# Curve Fitting with Maximum Likelihood

*[Context: Bishop’s Pattern Recognition and Machine Learning, section 1.2.5]*

As a way of demonstrating how probabilistic methods can be used to solve pattern recognition problems, Bishop explains how to use the idea of maximum likelihood to fit polynomials to a data set.

The problem is this: we have a set of ordered pairs \((x_i, t_i)\), where \(t_i = f(x_i)\) for some function \(f\), which we don’t know. It would be nice if we could approximate the unknown function, possibly because we want to know it’s value at some future point, beyond the largest \(x_i\) known to us. This is possible if we know/assume one thing about the process generating the values in \(\textbf{t}\).

Whatever the generating process actually is, there is a large chance the values we are seeing are not the pure outputs of that process. There is probably some random noise affecting our data. If we assume knowledge of how the values affecting our data are distributed, we can come up with an expression representing the likelihood of seeing the data.

Now, as much as I would like to be extremely thorough and explain the full background, I don’t think that’s the best use of my time. I think Bishop does a good job of this, and I won’t say anything about the introduction to this problem or the pros and cons of this technique that you can’t get from his book.

What you *will* get from this post that you won’t get from the book is a full,
detailed derivation of the matrix equation whose solution gives the coefficients
of the fitted polynomial. And you’ll get a short python script I wrote which
fits polynomials to data sets using this method.

This is one of those cases where the author says “and now we just minimize this function for \(\textbf{w}\).” I think there are many people who want to read this book but who wouldn’t know how to go about that. And so I feel good trying to fill that gap.

## The log likelihood

The initial expression for the likelihood of the data set is

$$ \begin{equation} \begin{split} p(\textbf{t} | \textbf{x}, \textbf{w}, \beta^{-1} ) & = \prod_{n=0}^{N} \mathcal{N}(t_n|y(x_n, \textbf{w}), \beta^{-1}) \newline & = \prod_{n=0}^N \frac{1}{(2\pi\beta^{-1})^{\frac{1}{2}}} \exp \bigg\{ -\frac{1}{2\beta^{-1}} \cdot \big[t_n - y(x_n, \textbf{w}) \big]^2 \bigg\}, \end{split} \end{equation} $$

but doing differential calculus with products of functions sucks big
time,^{1} so we take the natural log of the expression. This leads to
sums of individual functions, which will prove simpler to work with during the
maximization step. The derivation of the log likelihood follows, and only
involves logarithm
identities and
summation identities.

$$ \begin{equation} \begin{split} \ln \Big[ p(\textbf{t} | \textbf{x}, \textbf{w}, \beta^{-1} ) \Big] & = \ln \Big[ \prod_{n=0}^{N} \mathcal{N}(t_n|y(x_n, \textbf{w}), \beta^{-1}) \Big] \newline & = \sum_n^N \ln \Bigg[ \frac{1}{(2\pi\beta^{-1})^{\frac{1}{2}}} \exp \bigg\{ -\frac{1}{2\beta^{-1}} \cdot \big[t_n - y(x_n, \textbf{w}) \big]^2 \bigg\} \Bigg] \newline & = \sum_n^N \Bigg[ - \frac{1}{2\beta^{-1}} \big[t_n - y(x_n, \textbf{w}) \big]^2 - \ln \bigg[ (2\pi\beta^{-1})^{\frac{1}{2}} \bigg] \Bigg] \newline & = \sum_n^N \Bigg[ - \frac{1}{2\beta^{-1}} \big[t_n - y(x_n, \textbf{w}) \big]^2 - \frac{1}{2} \ln(2\pi) + \frac{1}{2} \ln(\beta) \Bigg] \newline & = \sum_n^N \Bigg[ - \frac{\beta}{2} \big[t_n - y(x_n, \textbf{w}) \big]^2 \Bigg] - \frac{N}{2} \ln(2\pi) + \frac{N}{2} \ln(\beta) \newline & = - \frac{\beta}{2} \sum_n^N \big[t_n - y(x_n, \textbf{w}) \big]^2 - \frac{N}{2} \ln(2\pi) + \frac{N}{2} \ln(\beta) \newline \end{split} \end{equation} $$

## Maximizing the likelihood

Our goal is to find the vector \(\textbf{w}_{ML}\) that, when plugged in for
\(\textbf{w}\) in the log expression, results in the highest possible value.
The polynomial whose coefficients are \(\textbf{w}_{ML}\) is the polynomial
that *most likely* correctly approximates the process that generated our data.
There are many ways to maximize functions, but this case is simple enough that
we can use the technique of setting the derivative equal to zero. This must be
carried out for each \(w_i \in \textbf{w}\), but as we will see, the
derivatives follow a nice pattern that can be generalized and written down once
and for all.

### Generalized Derivative

First we derive an expression that fits the derivative for all the \(w_j\).

$$ \begin{equation} \begin{split} \frac{\partial}{\partial w_j} \ln \Big[ p(\textbf{t} | \textbf{x}, \textbf{w}, \beta^{-1} ) \Big] & = \frac{\partial}{\partial w_j} \Big[ -\frac{\beta}{2} \sum_n^N \big\{ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \big\}^2 \Big] \newline & = -\frac{\beta}{2} \sum_n^N \frac{\partial}{\partial w_j} \Big[ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \Big]^2 \newline & = -\frac{\beta}{2} \sum_n^N 2 \Big[ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \Big] \frac{\partial}{\partial w_j} \Big[ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \Big] \newline & = -\beta \sum_n^N \Big[ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \Big] x_n^j \end{split} \end{equation} $$

### The Matrix Equation

Next we set the derivative equal to zero, and distribute the summation. This lets us move the term involving \(t_n\) to the other side of the equal sign, as well as pull out the vector of unknowns. The result is a matrix equation in the form \(\textbf{A}\textbf{x} = \textbf{b}\), which can be solved by a matrix inversion.

$$ \begin{equation} \begin{split} -\beta \sum_n^N \Big[ t_n - (w_0 + w_1x_n + \dots + w_mx_n^m) \Big] x_n^j & = 0 \newline -\beta \sum_n^N \Big[ t_nx_n^j - (w_0 + w_1x_n + \dots + w_mx_n^m)x_n^j \Big] & = 0 \newline -\beta \sum_n^N \Big[ t_nx_n^j - w_0x_n^j - w_1x_n^{j+1} - \dots - w_mx_n^{m+j} \Big] & = 0 \newline -\beta \sum_n^N t_nx_n^j + \beta\sum_n^N w_0x_n^j + \beta\sum_n^N w_1x_n^{j+1} + \dots + \beta\sum_n^N w_mx_n^{j+m} & = 0 \newline \beta\sum_n^N w_0x_n^j + \beta\sum_n^N w_1x_n^{j+1} + \dots + \beta\sum_n^N w_mx_n^{j+m} & = \beta \sum_n^N t_nx_n^j \newline \end{split} \end{equation} $$

$$ \begin{gather} \begin{bmatrix} \sum 1 & \sum x_n & \dots & \sum x_n^m \newline \sum x_n & \sum x_n^2 & \dots & \sum x_n^{m+1} \newline \vdots & \vdots & \ddots & \vdots \newline \sum x_n^N & \sum x_n^{N+1} & \dots & \sum x_n^{N+m} \end{bmatrix} \begin{bmatrix} w_0 \newline w_1 \newline \vdots \newline w_m \end{bmatrix} = \begin{bmatrix} \sum t_n \newline \sum x_nt_n \newline \vdots \newline \sum x_n^Nt_n \end{bmatrix} \end{gather} $$

## Fitting Curves In the Real World

As with most math I learn, I don’t feel like I’ve completed the process until I
am able to implement the idea as a computer program. And seeing sensible results
gives me more evidence that my understanding is correct.^{2} To that
end, I have written a short python program that uses the maximum likelihood
technique to fit a polynomial hiding behind Gaussian noise.

```
#!/usr/bin/env python3
import matplotlib.pyplot as plt
import numpy as np
def ml_fit(x, t, d):
N = len(x)
A = np.empty([N, d])
T = np.empty([N, 1])
for i in range(N):
for j in range(d):
A[i][j] = sum([xn**(i+j) for xn in x])
T[i] = sum([xn**i * tn for xn, tn in zip(x, t)])
return np.linalg.lstsq(A, T, rcond=None)
def polyeval(c, x):
return sum([c[i]*x**i for i in range(len(c))])
cheb = np.polynomial.chebyshev.Chebyshev((0,0,0,0,0,1))
cheb_coef = np.polynomial.chebyshev.cheb2poly(cheb.coef)
N = 300
noise = np.random.normal(0, .5, N)
xs = np.linspace(-1, 1, N)
ts = [polyeval(cheb_coef, xs[i]) + noise[i] for i in range(N)]
d = 6
w = ml_fit(xs, ts, d)[0]
fitted, = plt.plot(xs, polyeval(w, xs), color='r', label='Fitted Curve')
underlying, = plt.plot(xs, polyeval(cheb_coef, xs), color='g', label='Chebyshev Curve')
data = plt.scatter(xs, ts, s=5, label='Input Data')
plt.legend(handles=[fitted, underlying, data])
plt.title('Fitting w/ {} data points'.format(N))
plt.show()
```

The plots below show the results of using the script to fit a fifth degree Chebyshev polynomial on the interval [-1, 1], using successively more data points.

1: We here at The Pulsar Cafe appreciate the value of explanations which control anticipation, and so would never actually discard something because it “sucks big time.” But I like dropping those sorts of phrases in with technical explanations. The real reason you might want to avoid differential calculus with products of functions is that the chain rule can cause the number of functions you are dealing with in your identities to increase to an unwieldy amount. Especially if doing analysis by hand.

2: On the flipside, there have been several times that incorrect results from a
program have shown me my understanding was *not* correct, even when I thought it
was.