Testing Derivatives
Introduction
There still might be people out there who are actually so insane to implement functions and their derivatives by hand.
By hand I mean without refraining to some autodiff
functionality.
These features are offered by virtually every serious machine learning framework out there and for obvious reasons.
Well performing Backprop is essential for fast learning procedures.
However, and this might come as a shock to you, not everyone is working in modern statistical learning.
So some of us might, for instance, use functions that are not very common in deep learning, like complex valued functions, or transcendental ones.
As such, we might still do some implementations in numpy
or even C
, but we still wish to see, if our function and its supposed derivative actually match.
The Problem
Say, we have a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) with \(x \mapsto f(x)\) that is smooth, i.e., \(C^{\infty}\), so considering any order of derivative makes sense. Hence, we declare \(Jf : \mathbb{R}^n \rightarrow \mathbb(R)^{n \times m}\) as the Jacobian matrix of \(f\), which is defined as $$ x \mapsto J_i f_j (x) = \frac{\partial}{\partial x_i} f_j(x), 0 \leqslant i \leqslant n, 0 \leqslant j \leqslant m, $$ such that the function \(J_i f_j\) denotes the derivative of the \(i\)th coordinate of \(f\) with respect to the \(j\)th coordinate of its domain. As we wish to talk about code, we also have created implementations of \(f\) and \(J f\), where we are not sure yet that \(J f\) is correctly implemented and we wish to conduct a test.
The Real Problem
Let us take some steps back and assume that \(f : \mathbb{R} \rightarrow \mathbb{R}\) and we will see shortly how to generalize this back to our original case.
In this case, we also have that \(Jf : \mathbb{R} \rightarrow \mathbb{R}\).
Now in order to run a test, we choose a special \(x_0 \in \mathbb{R}\) and evaluate both our implementations to two sequences of values via
$$
y_k = f(x_0 + k \Delta), z_k = Jf(x_0 + k \Delta), 0 \leqslant k \leqslant K1
$$
for some stepsize \(\Delta > 0\) and \(k \in \mathbb{N}\).
Now we remember the geometric interpretation of a derivative as the best local linear approximation of a differentiable function and do
$$
\hat{z}_k = \frac{y_{k+1}  y_k}{\Delta}
$$
and we consider the error
$$
e_0(K, \Delta) = \frac 1{K1} \sum\limits_{k = 0}^{K2} \vert \hat{z}_k  z_k \vert.
$$
In some sense, this means we consider our implementation of \(f\) to be correct and take it for god’s will.
Hence, we can use its finite differences to approximate the analytical expression of our assumed derivative. If \(e_0\) is small we pat ourselves on the back and move on.
Now let us put this into action with some simple numpy
code as follows:


2.458479852518458e07
. Nice. Seems low, right? Is it low enough? If not, when is it low enough? If yes, why? You could imagine, we scale both implementations by a factor of 1e6
via


0.24584798518801107
. Now is this too big? It shouldn’t be (right?!) since the implementation is correct.It get’s worse. Imagine we intentionally cripple our implementation of the derivative via


4.5302802853471876e05
. Still pretty low, although we have introduced some severe ringing into our derivative. Fair enough, we added two orders of magnitude in error, but how would be know without doing the correct implementation first?I hope by now I have convinced you that just doing these forward differences and comparing them with our implementation is not robust enough to be considered trustworthy. So let us keep on thinking.
Approximating a Solution
If you have attended a numerical analysis course or just read the Wikipedia article about approximations of derivatives, you might know the following expression about the relationship of \(z_k\) and \(\hat{z}_k\) $$ \hat{z}_k  z_k = o(\Delta) \rightarrow 0, for \Delta \rightarrow 0, $$ which can also be derived from Taylor’s theorem. There even exist explicit bounds, stating that if \(\hat{z}_k\) is correct we have that $$ \vert \hat{z}_k  z_k \vert \leqslant \frac{\Delta^2}{2} \max_\limits{\xi \in [x, x+\Delta]} \vert J^2 f (\xi)\vert. $$ We did it, right?! We can simply use above right hand side to calculate a threshold for our error \(e_0\), right? Sadly, no. The reason is that the error term depends on \(J^2 f\), which is the second derivative of the function we are currently trying to differentiate. As such, it is somewhat useless, as it depends on a quantity we cannot claim confidently to have access to. Sure, bounds might exist, but we should not construct a test that is depending on such arcane knowledge.
The Solution
But a above formula reveals something else that is very interesting. It tells us how the error scales with respect to \(\Delta\)! Say we have chosen some arbitrary \(\Delta\) to get \(e_0(K, \Delta)\) and then just calculate the error again as \(e_0(K, \Delta/2)\). We should consult above bound to tell us that $$ \frac{e_0(K, \Delta)}{e_0(K, \Delta/2)} \approx 4, $$ which in turn is independent of specific choices of \(\Delta\) or \(K\) of the scaling of the function \(f\).
Let’s put this to a test as follows:


3.999975412707132
. Insane! Now scaling the two functions as before with 1e6
gives 3.9999754124654383
.
Up to floating point messing with us, this is the same number. Dope! So we need not worry about linear factors in our function anymore.But now what about the error we had introduced above? So let us go back to


1.8733898360740509
, which is wildly different from 4
. Now you might say that we can get closer to 4
if we just decrease our error. At some point you might be right, but even doing


1.1583847508006164
, which looks even worse, even though we have two orders of magnitude lower error.Finally, we can also increase the stepsize \(\Delta = 10^{2}\) so by a factor of 100, we still get 3.759148316873366
, so also while covering a much larger area of the domain we remain pretty stable in terms of error ratio.
Higher Dimensions
Let us go back to the original highdimensional problem. If \(f\) has vector valued output, we simply broadcast the test to all output dimensions, which is pretty straightforward, as we can basically treat all the dimensions independently.
Considering multiple dimensions in the domain of \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) we can resort to directional derivatives along lines with random directions \(d \in \mathbb{R}^n\). This essentially defines a new function \(g_d : \mathbb{R} \rightarrow \mathbb{R}\) defined as \(g_d(\lambda) = f(x_0 + \lambda \cdot d)\). Then the derivative \(J g_d\) obeys $$ J g_d (\lambda) = J f(x_0 + \lambda \cdot d) \cdot d, $$ which then can be used in above testing procedure straightforwardly possibly for multiple random directions. Since the correctness of the derivative of \(g_d\) depends on the correctness of \(J f\) we can use this as a test with the same power as above.
Conclusion
Summarizing, it is a bad idea to do direct comparison of approximation errors. This is due to the fact that this error scales linearly with linearly scaling the function under test, it is hard to determine a threshold without further knowledge and what I have not mentioned is that it depends on the approximation quality of the imposed finite difference method used for approximation.
In contrast, calculating multiple errors and studying their ratios is much more stable with respect to the issues above when comparing it to the expected change in error. Hence, I would claim that it greatly improves the robustness of your testing and you should use it.
Schlagwörter: #tech