Evaluating Likelihoods

9. März 2024

I do a lot of work that revolves around model based parameter estimation. In order to fit models to data one can invoke the maximum likelihood principle, as it has nice statistical properties in terms of bias and variance of the resulting estimators. Due to the fact that I mostly deal with measurement noise and other distributions where it is safe to assume that the central limit theorem holds, I am usually concerned with Gaussian random variables. One of the things we have to do during optimization is actually calculating the likelihood’s current value and possibly (read: certainly) some partial first- and second-order derivatives. However, in cases when the data itself is high-dimensional, the model for the mean being algebraically more involved or the covariance matrix non-identity, one needs to invest some kind of effort into an efficient implementation.

To do this, we are using the Einstein notation together with its realization in code as numpy.einsum in Python. This allows us to efficiently and flexibly calculate very complex and high-dimensional models with ease while also exploiting a large part of the algebraic structure of the model. We will also see how to conveniently derive invocations of numpy.einsum for partial derivatives that one might need for gradient-based optimization.

The Setup

Say we have data $y$ that follows an $\boldsymbol N$-dimensional complex-valued (I am sorry) normal distribution $Y$ with parametric mean $s(\theta)$ with $\theta \in T$ for some real vector space $T$ and covariance $\Sigma(\delta)$ for $\delta \in D$ for some other real vector space $D$. In this case the negative $\log$-likelihood function given some data $y \sim Y$ reads as $$ \lambda(\theta, \delta \vert y) = \log(\det(\Sigma(\delta))) + (y - s(\theta))^H \cdot \Sigma(\delta)^{-1} \cdot (y - s(\theta)) $$ Usually, what I try to do is to maximize this function with respect to $\theta$ and $\delta$ given a certain $y$. The structure of the parametric models both for the mean and for the covariance depend on our measurement setup, measured scenario and some physics.

The Mean

Let us focus first on the mapping $s$. In order to proceed we need to impose some kind of structure onto it. As many physical phenomena are linear, so is ours. As such we can write down $s: \mathbb{R}^{P \times S} \rightarrow \mathbb{C}^N$ as $$ s(\theta) = \sum\limits_{i = 1}^S \gamma_i \cdot a(\eta_i), $$ where $\gamma_i \in \mathbb{C}^{\boldsymbol n}$ and $a : \mathbb{R}^e \rightarrow \mathbb{C}^{\boldsymbol m}$. You can imagine $\theta$ being a 2D-array, where each column contains the parameters of each summand. Note that we are doing the intuitive thing and parametrize $\gamma_i$ by its real and imaginary part in $\theta$ respectively. This is already a nice structure both for our parameters as well as the algebraic structure of $s$.

You may have noticed the two bold symbols that are used to denote the dimensionality of $\gamma$ and the values of the mapping $a$. We need to talk – about multi-indices. In this document we might be traveling into the realms of multi-dimensional arrays. I am not talking about tensors. Tensors encode linear mappings. Here, I am only talking about the data-structure of a multi-dimensional array. A multi-index is simply an ordered tuple of natural numbers, i.e., $\boldsymbol i = (i_1, \dots, i_k) \in \mathbb{N}^k$. We are going to use them both for denoting the size of certain things as well as accessing certain values in a multi-dimensional array. Moving on.

Actually, no let’s stick to this point a little linger. The problem now is that the $\cdot$ in $$ \gamma_i \cdot a(\eta_i) $$ as used in above equation makes no sense, since the two arrays can have dimensions that are not in agreement with what we usually use the $\cdot$ for. To fix this, we first need another tool and then we are going to fix above equation.

Einstein Notation

In some sense, we just need to define what the indicated multiplication means. Such multi-dimensional array operations are basically an old hat for physicists and this is why they developed clever ways of conducting such operations. Let’s look at the well known matrix-matrix product. It reads as $$ C_{i,\ell} = \sum\limits_{k} A_{i,k} \cdot B_{k,\ell}, $$ and please feel free to imagine the matrices being of compatible sizes – we are not counting crows here. However, focus on the fact that in some sense the summation symbol is actually very distracting, so let’s get rid of it, such that we end up with $$ C_{i,\ell} = A_{i,k} \cdot B_{k,\ell}. $$ Have we lost information or has the equation become ambiguous? The answer is no. We only have to silently assume that first arrays on the side of an equation sharing an index are multiplied along these dimensions. If additionally an indexing variable shows up on one hand side of an equation and not on the other we are also conducting a summation along this axis. This a trimmed down flavor of the Einstein notation. ¹ One thing we are going to add to this convention is that we also use this for multi-indices, since this makes the reading of some expressions much more digestible. Say for instance we have $$ C_{p_1, p_2, \dots, p_d, f_1, \dots, f_h} = A_{p_1, p_2, \dots, p_d, d_1, \dots, d_e} \cdot B_{d_1, \dots, d_e, f_1, \dots, f_h} $$ for some big-ass array-operation. Then, we can juice it up with our multi-indices and simply write $$ C_{\boldsymbol p, \boldsymbol f} = A_{\boldsymbol p, \boldsymbol d} \cdot B_{\boldsymbol d, \boldsymbol f}, $$ if we take great care in defining the multi-indices correctly. However, if we are really sure about what we are doing, we could go one step further and say that we simply write $$ C_{\boldsymbol c} = A_{\boldsymbol a} \cdot B_{\boldsymbol b}, $$ since we know that the notation takes care of the correct result for us! Now, of course this is some some degree very confusing, especially for humans. A computer, in contrast, is very good at blindly following orders. And this is where numpy.einsum² comes into play. It is a function that is designed to apply operations on an arbitrary amount of multi-dimensional arrays, where the operations are encoded in a special flavor of the Einstein notation.

Say we wish to conduct above operation the tedious way. then we write

1
2
3
4
5
C = np.einsum(
    A, [p1,p2,..,pd,d1,..,de], 
    B, [d1,..,de,f1,..,fh], 
    [p1,..,pd,f1,..,fh]
)

and please do not mistake .. for an Ellipsis, but instead imagine we name list elements there one by one. Notice how closely einsum follows the convention we just laid out. And since we can do stuff in code we have no problem with pre-defining the lists and then just doing

1
2
3
4
5
6
7
p = [p1,p2,..,pd]
d = [d1,..,de]
f = [f1,..,fh]
a = p + d
b = d + f,
c = p + f
C = np.einsum(A, a, B, b, c)

which gives us exactly what we want. I would like to stress how much of a fire and forget solution this can be, if you can really think about the operation you want to do and then let einsum figure out the rest. As of right now this seems excessive given the fact that we wish to just multiply two arrays with each other, but see shortly how such a tool can be very useful.

Fixing the Setup

With the Einstein notation and einsum at our disposal we can now fix several things from above equations. We start with the mean, which we change to $$ s_{\boldsymbol m}(\theta) = \gamma_{i, \boldsymbol \ell} \cdot a_{\boldsymbol p}(\eta_i), $$ where we slightly abuse our notation in the sense that $\eta_i$ is the $i$-th column of $\eta$ but does not really participate in the Einstein contraction process – only after evaluating the function $a$.

Additionally, we can now even also fix the evaluation of the likelihood itself, since this also only makes limited sense if we are dealing with multi-dimensional data. It is not clear, what Hermitian transposes are, and again how exactly the vector-matric-vector product is supposed to be conducted. We should reformulate it to $$ \lambda(\theta, \delta \vert y) = \log(\det(\Sigma(\delta))) + (y - s(\theta))_{\boldsymbol m_1}^\ast \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot (y - s(\theta))_{\boldsymbol m_2}, $$ where we also correctly account for the way a covariance of a multi-dimensional Gaussian random variable acts on such arrays. Moreover, we denote with $z^\ast$ the complex conjugate of $z \in \mathbb{C}$. Also, note that $\det$ and the $\cdot^{-1}$ operator applied to the covariance tensor $\Sigma$ are to be understood as the same operation one would get, if the Gaussian vectors are vectorized and then we would consider those vectors’ positive definite covariance matrix, where the normal linear algebra formulations hold.

Likelihood Evaluation using Einstein Notation

Now let us again focus on the part of the likelihood that involves the parameter $\theta$. In this case we can drop the parts that are constant with respect to $\theta$, which gives us $$ \lambda(\theta, \delta \vert y) = (y - s(\theta))^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot (y - s(\theta))_{\boldsymbol m_2}. $$ and after some reworking, we end up with $$ \begin{align*} \lambda(\theta, \delta \vert y) =~& y^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot y_{\boldsymbol m_2} + s(\theta)^\ast_{\boldsymbol m_1} \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} \\ &- 2 \Re \{ y_{\boldsymbol m_1}^\ast \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} \}. \end{align*} $$ You might wonder why this is actually useful, but let us now consider how we can implement this. The second summand in above equation can now be combined with the notation for the mean $s(\theta)$ and we get $$ s(\theta)_{\boldsymbol m_1} ^\ast \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot s(\theta)_{\boldsymbol m_2} = \gamma_{i_1, \boldsymbol \ell_1}^\ast \cdot a_{\boldsymbol p_1}(\eta_{i_1})^\ast \cdot \Sigma(\delta)^{-1}_{\boldsymbol m_1, \boldsymbol m_2} \cdot \gamma_{i_2, \boldsymbol \ell_2} \cdot a_{\boldsymbol p_2}(\eta_{i_2}). $$ This might look very messy, but if we take a look at a code example things should become much clearer.

1
2
3
4
s = np.einsum(gamma, [0,2], a, [0,1,2], [0,1])
llfSummand = np.einsum(
    s.conj(), [0,1], SigmaInv, [0,1,2,3], s, [2,3], []
)

Notice how in above listing the array $s(\theta)$ is used twice with different indices. This corresponds to the above use of $\boldsymbol m_1$ and $\boldsymbol m_2$. The other quadratic term can be turned into the following call.

1
2
3
4
5
6
llfSummand = np.einsum(
    gamma.conj(), [0,4], a.conj(), [0,1,4],
    SigmaInv, [0,1,2,3], 
    gamma, [2,5], a, [2,3,5], 
    []
)

Once one has gotten used to juggling these index sets, the whole computation becomes very intuitive and straightforward, since the Einstein notation assigns certain roles to axes and tracks those consistently across the executed products.

Exploiting more Structure

So far, we have only used the linear structure of the model $s$, but in many cases the so-called atomic function $a$ also has some structure in the sense that some chunks of the parameter $\theta$ are actually decoupled and maybe even influence different axes (or sloppily: dimensions) of the observations $y$. For us, the most relevant case is when it can be expressed as a contraction of several distinct functions. This can be formalized as $$ a(\theta) = a([\eta_1, \dots, \eta_k])_{\boldsymbol{p}} = a_1(\eta_1)_{\boldsymbol{n}_1} \cdot \dotsc \cdot a_k(\eta_k)_{\boldsymbol{n}_k}, $$ which first splits up the atomic function into a product of multiple ones which are chosen in such a way that the parameters can also be split up like this. By doing so, the parameter space becomes a product space and the atomic functions follow in the sense that we now only model their interaction by the specific form of their multiplication. Note that the multi-indices $\boldsymbol{n}_\ell $ do not necessarily need to be distinct. So it could be that two atoms share an index, which means they both contain the same data dimension and are multiplied along this dimension. As such, the corresponding code would look like

1
2
3
4
atoms = np.einsum(
    a1, n1, a2, n2, .., ak, nk 
    p
)

and in a complete np.einsum expression to evaluate one summand of the likelihood, we would have

1
2
3
4
5
6
7
8
llfSummand = np.einsum(
    gamma.conj(), ell1, 
    a1.conj(), n11, a2.conj(), n12, .., ak.conj(), n1k,
    SigmaInv, m1 + m2, 
    gamma.conj(), ell2, 
    a1.conj(), n21, a2.conj(), n22, .., ak.conj(), n2k,
    []
)

where we need to make sure that the index lists for the first set of atoms and $\gamma$ is matching the index list $\boldsymbol{m}_1$ and the second one $\boldsymbol{m}_2$, respectively.

Conclusion

We have derived how we can gradually put increasing load onto np.einsum and less onto our brains. Note that there is a GPU version of this function in the cupy³ package. Additionally, one can optimize even further by using the opt_einsum⁴ package, that figures out how a certain np.einsum should actually be executed, since we can exploit associativity to minimize the number of necessary floating point operations.

In a second post we will be evaluating likelihoods even harder by treating the covariance with more care and also by dealing with derivatives and Fisher Information things.

Tags: #Tech