Evaluating Likelihoods
I do a lot of work that revolves around model based parameter estimation. In order to fit models to data one can invoke the maximum likelihood principle, as it has nice statistical properties in terms of bias and variance of the resulting estimators. Due to the fact that I mostly deal with measurement noise and other distributions where it is safe to assume that the central limit theorem holds, I am usually concerned with Gaussian random variables. One of the things we have to do during optimization is actually calculating the likelihood’s current value and possibly (read: certainly) some partial first and secondorder derivatives. However, in cases when the data itself is highdimensional, the model for the mean being algebraically more involved or the covariance matrix nonidentity, one needs to invest some kind of effort into an efficient implementation.
To do this, we are using the Einstein notation together with its realization in code as numpy.einsum
in Python. This allows us to efficiently and flexibly calculate very complex and highdimensional models with ease while also exploiting a large part of the algebraic structure of the model. We will also see how to conveniently derive invocations of numpy.einsum
for partial derivatives that one might need for gradientbased optimization.
The Setup
Say we have data \(y\) that follows an \(\boldsymbol N\)dimensional complexvalued (I am sorry) normal distribution \(Y\) with parametric mean \(s(\theta)\) with \(\theta \in T\) for some real vector space \(T\) and covariance \(\Sigma(\delta)\) for \(\delta \in D\) for some other real vector space \(D\). In this case the negative \(\log\)likelihood function given some data \(y \sim Y\) reads as $$ \lambda(\theta, \delta \vert y) = \log(\det(\Sigma(\delta))) + (y  s(\theta))^H \cdot \Sigma(\delta)^{1} \cdot (y  s(\theta)) $$ Usually, what I try to do is to maximize this function with respect to \(\theta\) and \(\delta\) given a certain \(y\). The structure of the parametric models both for the mean and for the covariance depend on our measurement setup, measured scenario and some physics.
The Mean
Let us focus first on the mapping \(s\). In order to proceed we need to impose some kind of structure onto it. As many physical phenomena are linear, so is ours. As such we can write down \(s: \mathbb{R}^{P \times S} \rightarrow \mathbb{C}^N\) as $$ s(\theta) = \sum\limits_{i = 1}^S \gamma_i \cdot a(\eta_i), $$ where \(\gamma_i \in \mathbb{C}^{\boldsymbol n}\) and \(a : \mathbb{R}^e \rightarrow \mathbb{C}^{\boldsymbol m}\). You can imagine \(\theta\) being a 2Darray, where each column contains the parameters of each summand. Note that we are doing the intuitive thing and parametrize \(\gamma_i\) by its real and imaginary part in \(\theta\) respectively. This is already a nice structure both for our parameters as well as the algebraic structure of \(s\).
You may have noticed the two bold symbols that are used to denote the dimensionality of \(\gamma\) and the values of the mapping \(a\). We need to talk – about multiindices. In this document we might be traveling into the realms of multidimensional arrays. I am not talking about tensors. Tensors encode linear mappings. Here, I am only talking about the datastructure of a multidimensional array. A multiindex is simply an ordered tuple of natural numbers, i.e., \(\boldsymbol i = (i_1, \dots, i_k) \in \mathbb{N}^k\). We are going to use them both for denoting the size of certain things as well as accessing certain values in a multidimensional array. Moving on.
Actually, no let’s stick to this point a little linger. The problem now is that the \(\cdot\) in $$ \gamma_i \cdot a(\eta_i) $$ as used in above equation makes no sense, since the two arrays can have dimensions that are not in agreement with what we usually use the \(\cdot\) for. To fix this, we first need another tool and then we are going to fix above equation.
Einstein Notation
In some sense, we just need to define what the indicated multiplication means. Such multidimensional array operations are basically an old hat for physicists and this is why they developed clever ways of conducting such operations. Let’s look at the well known matrixmatrix product. It reads as
$$
C_{i,\ell} = \sum\limits_{k} A_{i,k} \cdot B_{k,\ell},
$$
and please feel free to imagine the matrices being of compatible sizes – we are not counting crows here.
However, focus on the fact that in some sense the summation symbol is actually very distracting, so let’s get rid of it, such that we end up with
$$
C_{i,\ell} = A_{i,k} \cdot B_{k,\ell}.
$$
Have we lost information or has the equation become ambiguous? The answer is no. We only have to silently assume that first arrays on the side of an equation sharing an index are multiplied along these dimensions.
If additionally an indexing variable shows up on one hand side of an equation and not on the other we are also conducting a summation along this axis.
This a trimmed down flavor of the Einstein notation. ^{1}
One thing we are going to add to this convention is that we also use this for multiindices, since this makes the reading of some expressions much more digestible. Say for instance we have
$$
C_{p_1, p_2, \dots, p_d, f_1, \dots, f_h} = A_{p_1, p_2, \dots, p_d, d_1, \dots, d_e} \cdot B_{d_1, \dots, d_e, f_1, \dots, f_h}
$$
for some bigass arrayoperation. Then, we can juice it up with our multiindices and simply write
$$
C_{\boldsymbol p, \boldsymbol f} = A_{\boldsymbol p, \boldsymbol d} \cdot B_{\boldsymbol d, \boldsymbol f},
$$
if we take great care in defining the multiindices correctly.
However, if we are really sure about what we are doing, we could go one step further and say that we simply write
$$
C_{\boldsymbol c} = A_{\boldsymbol a} \cdot B_{\boldsymbol b},
$$
since we know that the notation takes care of the correct result for us! Now, of course this is some some degree very confusing, especially for humans.
A computer, in contrast, is very good at blindly following orders. And this is where numpy.einsum
^{2} comes into play. It is a function that is designed to apply operations on an arbitrary amount of multidimensional arrays, where the operations are encoded in a special flavor of the Einstein notation.
Say we wish to conduct above operation the tedious way. then we write


..
for an Ellipsis
, but instead imagine we name list elements there one by one.
Notice how closely einsum
follows the convention we just laid out. And since we can do stuff in code we have no problem with predefining the lists and then just doing


einsum
figure out the rest. As of right now this seems excessive given the fact that we wish to just multiply two arrays with each other, but see shortly how such a tool can be very useful.Fixing the Setup
With the Einstein notation and einsum
at our disposal we can now fix several things from above equations. We start with the mean, which we change to
$$
s_{\boldsymbol m}(\theta) = \gamma_{i, \boldsymbol \ell} \cdot a_{\boldsymbol p}(\eta_i),
$$
where we slightly abuse our notation in the sense that \(\eta_i\) is the \(i\)th column of \(\eta\) but does not really participate in the Einstein contraction process – only after evaluating the function \(a\).
Additionally, we can now even also fix the evaluation of the likelihood itself, since this also only makes limited sense if we are dealing with multidimensional data. It is not clear, what Hermitian transposes are, and again how exactly the vectormatricvector product is supposed to be conducted. We should reformulate it to $$ \lambda(\theta, \delta \vert y) = \log(\det(\Sigma(\delta))) + (y  s(\theta))_{\boldsymbol m_1}^\ast \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot (y  s(\theta))_{\boldsymbol m_2}, $$ where we also correctly account for the way a covariance of a multidimensional Gaussian random variable acts on such arrays. Moreover, we denote with \(z^\ast\) the complex conjugate of \(z \in \mathbb{C}\). Also, note that \(\det\) and the \(\cdot^{1}\) operator applied to the covariance tensor \(\Sigma\) are to be understood as the same operation one would get, if the Gaussian vectors are vectorized and then we would consider those vectors’ positive definite covariance matrix, where the normal linear algebra formulations hold.
Likelihood Evaluation using Einstein Notation
Now let us again focus on the part of the likelihood that involves the parameter \(\theta\). In this case we can drop the parts that are constant with respect to \(\theta\), which gives us $$ \lambda(\theta, \delta \vert y) = (y  s(\theta))^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot (y  s(\theta))_{\boldsymbol m_2}. $$ and after some reworking, we end up with $$ \begin{align*} \lambda(\theta, \delta \vert y) =~& y^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot y_{\boldsymbol m_2} + s(\theta)^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} \\ & 2 \Re \{ y_{\boldsymbol m_1}^\ast \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} \}. \end{align*} $$ You might wonder why this is actually useful, but let us now consider how we can implement this. The second summand in above equation can now be combined with the notation for the mean \(s(\theta)\) and we get $$ s(\theta)_{\boldsymbol m_1} ^\ast \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} = \gamma_{i_1, \boldsymbol \ell_1}^\ast \cdot a_{\boldsymbol p_1}(\eta_{i_1})^\ast \cdot \Sigma(\delta)^{1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot \gamma_{i_2, \boldsymbol \ell_2} \cdot a_{\boldsymbol p_2}(\eta_{i_2}). $$ This might look very messy, but if we take a look at a code example things should become much clearer.




Once one has gotten used to juggling these index sets, the whole computation becomes very intuitive and straightforward, since the Einstein notation assigns certain roles to axes and tracks those consistently across the executed products.
Exploiting more Structure
So far, we have only used the linear structure of the model \(s\), but in many cases the socalled atomic function \(a\) also has some structure in the sense that some chunks of the parameter \(\theta\) are actually decoupled and maybe even influence different axes (or sloppily: dimensions) of the observations \(y\). For us, the most relevant case is when it can be expressed as a contraction of several distinct functions. This can be formalized as $$ a(\theta) = a([\eta_1, \dots, \eta_k])_{\boldsymbol{p}} = a_1(\eta_1)_{\boldsymbol{n}_1} \cdot \dotsc \cdot a_k(\eta_k)_{\boldsymbol{n}_k}, $$ which first splits up the atomic function into a product of multiple ones which are chosen in such a way that the parameters can also be split up like this. By doing so, the parameter space becomes a product space and the atomic functions follow in the sense that we now only model their interaction by the specific form of their multiplication. Note that the multiindices \(\boldsymbol{n}_\ell \) do not necessarily need to be distinct. So it could be that two atoms share an index, which means they both contain the same data dimension and are multiplied along this dimension. As such, the corresponding code would look like


np.einsum
expression to evaluate one summand of the likelihood, we would have


Conclusion
We have derived how we can gradually put increasing load onto np.einsum
and less onto our brains. Note that there is a GPU version of this function in the cupy
^{3} package. Additionally, one can optimize even further by using the opt_einsum
^{4} package, that figures out how a certain np.einsum
should actually be executed, since we can exploit associativity to minimize the number of necessary floating point operations.
In a second post we will be evaluating likelihoods even harder by treating the covariance with more care and also by dealing with derivatives and Fisher Information things.
Tags: #Tech