Blogito, ergo sum.

Testing Derivatives

Introduction

There still might be people out there who are actually so insane to implement functions and their derivatives by hand. By hand I mean without refraining to some autodiff functionality. These features are offered by virtually every serious machine learning framework out there and for obvious reasons. Well performing Backprop is essential for fast learning procedures. However, and this might come as a shock to you, not everyone is working in modern statistical learning. So some of us might, for instance, use functions that are not very common in deep learning, like complex valued functions, or transcendental ones. As such, we might still do some implementations in numpy or even C, but we still wish to see, if our function and its supposed derivative actually match.

The Problem

Say, we have a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) with \(x \mapsto f(x)\) that is smooth, i.e., \(C^{\infty}\), so considering any order of derivative makes sense. Hence, we declare \(Jf : \mathbb{R}^n \rightarrow \mathbb(R)^{n \times m}\) as the Jacobian matrix of \(f\), which is defined as $$ x \mapsto J_i f_j (x) = \frac{\partial}{\partial x_i} f_j(x), 0 \leqslant i \leqslant n, 0 \leqslant j \leqslant m, $$ such that the function \(J_i f_j\) denotes the derivative of the \(i\)-th coordinate of \(f\) with respect to the \(j\)-th coordinate of its domain. As we wish to talk about code, we also have created implementations of \(f\) and \(J f\), where we are not sure yet that \(J f\) is correctly implemented and we wish to conduct a test.

The Real Problem

Let us take some steps back and assume that \(f : \mathbb{R} \rightarrow \mathbb{R}\) and we will see shortly how to generalize this back to our original case. In this case, we also have that \(Jf : \mathbb{R} \rightarrow \mathbb{R}\). Now in order to run a test, we choose an arbitrary but fixed \(x_0 \in \mathbb{R}\) and evaluate both our implementations to two sequences of values via $$ y_k = f(x_0 + k \Delta), z_k = Jf(x_0 + k \Delta), 0 \leqslant k \leqslant K-1 $$ for some stepsize \(\Delta > 0\) and \(k \in \mathbb{N}\). Now we remember the geometric interpretation of a derivative as the best local linear approximation of a differentiable function and do $$ \hat{z}_k = \frac{y_{k+1} - y_k}{\Delta} $$ and we consider the error $$ e_0(K, \Delta) = \frac 1{K-1} \sum\limits_{k = 0}^{K-2} \vert \hat{z}_k - z_k \vert. $$ In some sense, this means we consider our implementation of \(f\) to be correct and take it for god’s will. Hence, we can use its finite differences to approximate the analytical expression of our assumed derivative. If \(e_0\) is small we pat ourselves on the back and move on. Now let us put this into action with some simple numpy code as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import numpy as np

def f(x):
    return np.sin(x)


def Jf(x):
    return np.cos(x)

K = 100
Delta = 1e-4

x = np.linspace(0, K * Delta, K, endpoint=False)
y = f(x)
z = Jf(x)
zhat = (y[1:] - y[:-1]) / Delta

print(np.sum(np.abs(z[1:] - zhat)) / K)
This prints 2.458479852518458e-07. Nice. Seems low, right? Is it low enough? If not, when is it low enough? If yes, why? You could imagine, we scale both implementations by a factor of 1e6 via
1
2
3
4
5
def f(x):
    return 1e6 * np.sin(x)

def Jf(x):
    return 1e6 * np.cos(x)
Still our derivative would be correct, but the error would be 0.24584798518801107. Now is this too big? It shouldn’t be (right?!) since the implementation is correct.

It get’s worse. Imagine we intentionally cripple our implementation of the derivative via

1
2
def Jf(x):
    return np.cos(x) + 1e-4 * np.sin(100 * x)
and run above original code again, we get 4.5302802853471876e-05. Still pretty low, although we have introduced some severe ringing into our derivative. Fair enough, we added two orders of magnitude in error, but how would be know without doing the correct implementation first?

I hope by now I have convinced you that just doing these forward differences and comparing them with our implementation is not robust enough to be considered trustworthy. So let us keep on thinking.

Approximating a Solution

If you have attended a numerical analysis course or just read the Wikipedia article about approximations of derivatives, you might know the following expression about the relationship of \(z_k\) and \(\hat{z}_k\) $$ \hat{z}_k - z_k = o(\Delta) \rightarrow 0, for \Delta \rightarrow 0, $$ which can also be derived from Taylor’s theorem. There even exist explicit bounds, stating that if \(\hat{z}_k\) is correct we have that $$ \vert \hat{z}_k - z_k \vert \leqslant \frac{\Delta^2}{2} \max_\limits{\xi \in [x, x+\Delta]} \vert J^2 f (\xi)\vert. $$ We did it, right?! We can simply use above right hand side to calculate a threshold for our error \(e_0\), right? Sadly, no. The reason is that the error term depends on \(J^2 f\), which is the second derivative of the function we are currently trying to differentiate. As such, it is somewhat useless, as it depends on a quantity we cannot claim confidently to have access to. Sure, bounds might exist, but we should not construct a test that is depending on such arcane knowledge.

The Solution

But a above formula reveals something else that is very interesting. It tells us how the error scales with respect to \(\Delta\)! Say we have chosen some arbitrary \(\Delta\) to get \(e_0(K, \Delta)\) and then just calculate the error again as \(e_0(K, \Delta/2)\). We should consult above bound to tell us that $$ \frac{e_0(K, \Delta)}{e_0(K, \Delta/2)} \approx 4, $$ which in turn is independent of specific choices of \(\Delta\) or \(K\) of the scaling of the function \(f\).

Let’s put this to a test as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np

def f(x):
    return np.sin(x)

def Jf(x):
    return np.cos(x)

K = 100
Delta = 1e-4

x = np.linspace(0, K * Delta, K, endpoint=False)
y = f(x)
z = Jf(x)
zhat = (y[1:] - y[:-1]) / Delta
e1 = np.sum(np.abs(z[1:] - zhat)) / K

Delta = 0.5 * Delta

x = np.linspace(0, K * Delta, K, endpoint=False)
y = f(x)
z = Jf(x)
zhat = (y[1:] - y[:-1]) / Delta
e2 = np.sum(np.abs(z[1:] - zhat)) / K

print(e1 / e2)
This outputs 3.999975412707132. Insane! Now scaling the two functions as before with 1e6 gives 3.9999754124654383. Up to floating point messing with us, this is the same number. Dope! So we need not worry about linear factors in our function anymore.

But now what about the error we had introduced above? So let us go back to

1
2
def Jf(x):
    return np.cos(x) + 1e-4 * np.sin(100 * x)
and we get 1.8733898360740509, which is wildly different from 4. Now you might say that we can get closer to 4 if we just decrease our error. At some point you might be right, but even doing
1
2
def Jf(x):
    return np.cos(x) + 1e-6 * np.sin(100 * x)
does yield 1.1583847508006164, which looks even worse, even though we have two orders of magnitude lower error.

Finally, we can also increase the stepsize \(\Delta = 10^{-2}\) so by a factor of 100, we still get 3.759148316873366, so also while covering a much larger area of the domain we remain pretty stable in terms of error ratio.

Higher Dimensions

Let us go back to the original high-dimensional problem. If \(f\) has vector valued output, we simply broadcast the test to all output dimensions, which is pretty straightforward, as we can basically treat all the dimensions independently.

Considering multiple dimensions in the domain of \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) we can resort to directional derivatives along lines with random directions \(d \in \mathbb{R}^n\). This essentially defines a new function \(g_d : \mathbb{R} \rightarrow \mathbb{R}\) defined as \(g_d(\lambda) = f(x_0 + \lambda \cdot d)\). Then the derivative \(J g_d\) obeys $$ J g_d (\lambda) = J f(x_0 + \lambda \cdot d) \cdot d, $$ which then can be used in above testing procedure straightforwardly possibly for multiple random directions. Since the correctness of the derivative of \(g_d\) depends on the correctness of \(J f\) we can use this as a test with the same power as above.

Conclusion

Summarizing, it is a bad idea to do direct comparison of approximation errors. This is due to the fact that this error scales linearly with linearly scaling the function under test, it is hard to determine a threshold without further knowledge and what I have not mentioned is that it depends on the approximation quality of the imposed finite difference method used for approximation.

In contrast, calculating multiple errors and studying their ratios is much more stable with respect to the issues above when comparing it to the expected change in error. Hence, I would claim that it greatly improves the robustness of your testing and you should use it.

Tags: #Tech