Predictive Statistics and Machine Learning aim at building models with parameters such that the final output/prediction is as close as possible to the actual value. This implies the optimization of an objective function, which might be either minimized (like loss functions) or maximized (like Maximum Likelihood function).

The idea behind optimization routine is starting from an initial random value of our target function’s variable, and then updating this value according to a given rule. The iteration stops when the distance between two values is approaching zero, or maybe after a maximum number of iterations arbitrarily decided.

In this article, I’m going to provide an intuitive explanation of the Newton method, proposing a geometric interpretation. The assumption behind this method is that our target function f(x), the one we want to optimize, is twice differentiable and f”(x) is not equal to zero. Here, we will be working with a smooth and regular polynomial:

import numpy as np import math import matplotlib.pyplot as plt def f(x): return x**3-6*x**2+4*x+2 x = np.linspace(-1, 1) fig, ax = plt.subplots() ax.plot(x, f(x), label='target') ax.grid()

Now, the idea of Newton method is that, in order to find the optimum (in our case, the maximum) of our target, we have to start by approximating our curve in a random starting point with a second order Taylor extension, then compute the maximum of that extension (that means, taking its first derivative and setting it equal to zero). The maximum of the quadratic approximation will be the new value of our x. Then, with an iterative procedure, this is repeated and the resulting series of x, xn, tends to converge towards the maximum of the target, x*, as n tends towards infinite.

However, let’s proceed step by step, and let’s first compute the Taylor expansion of our function at a given point x0:

Now, we want the first derivative of this quantity to be equal to 0, since we want to compute the optimum of this new curve. More specifically, we want to differentiate with respect to (x-x0), which is the increment needed to reach the maximum of the approximation. Hence:

This formula can be generalized with the following updating rule:

Let’s see it geometrically:

As you can see, as xn approaches x*, the distance between two x values approaches towards zero. As anticipated, since it might be possible that xn and x* never perfectly match, one might impose a stopping criterion when, for example, the updating factor is less than an arbitrarily small epsilon.

Now let’s implement it in Python, using as target the function we already defined. To proceed with Newton method, we have to define three further elements: first and second order derivative, and the quadratic approximation:

def fprime(x): return 3*x**2-12*x+4 def fsecond(x): return 6*x - 12 def quadratic_approx(x, x0, f, fprime, fsecond): return f(x0)+fprime(x0)*(x-x0)+0.5*fsecond(x0)*(x-x0)**2

Let’s first have a look at the plot of the quadratic approximation in x0=0:

x = np.linspace(-1, 1) fig, ax = plt.subplots() ax.plot(x, f(x), label='target') ax.grid() ax.plot(x, quadratic_approx(x, 0, f, fprime, fsecond), color='red', label='quadratic approximation') ax.set_ylim([-2,3]) ax.axhline(y=0, color='k') ax.axvline(x=0, color='k') plt.legend()

Now let’s define our Newton method:

def newton(x0, fprime, fsecond, maxiter=100, eps=0.0001): x=x0 for i in range(maxiter): xnew=x-(fprime(x)/fsecond(x)) if xnew-x<eps: return xnew print('converged') break x = xnew return x

Let’s plot the result with the x_star:

x_star=newton(0, fprime, fsecond) fig, ax = plt.subplots() ax.plot(x, f(x), label='target') ax.grid() ax.plot(x, quadratic_approx(x, x_star , f, fprime, fsecond), color='red', label='quaratic approximation') ax.set_ylim([-2,3]) ax.axhline(y=0, color='k') ax.axvline(x=0, color='k') ax.axvline(x = x_star, color='green') plt.legend()

As you can see, now the maximum of the linear approximation coincides (approximately) with the maximum of our target. Of course, the iteration didn’t last a lot, since we started from a x0 pretty close to the real maximum of the function. Nevertheless, this is the exact procedure whatever the starting point is.

Newton method is extremely powerful, yet it can be applied only under the assumption of twice differentiability, differently from other optimization methods as gradient descent (which only requires first differentiability). Hence, you might want to use this method only in case of ‘well-behaving’ function, like our polynomial of the example above.

I hope this geometric and intuitive interpretation of Newton method was clear, thanks for reading!