4. Lagrange Polynomials and Their Problems
≪ 3. Analysing Order of Convergence and Steffensen's Method | Table of Contents | 5. Numerical Integration and Differentiation ≫The focus of today will be on interpolating polynomials. We’ll review some of the theory behind them (e.g. error analysis), and we’ll look at the various different ways of computing interpolating polynomials.
We have the following general setup: there are some points \(x_0,\ldots, x_n\) on the real number line, and there is some data \(y_0,\ldots, y_n\). One should think of this as describing the \(n+1\) points \(\left( x_0,y_0 \right), \ldots, \left( x_n, y_n \right)\) on the graph of some unknown function, and one is interested in “connecting the dots” systematically and hopefully producing an error bound on how far the true function could possibly be. In other words, \(y_k = f\left( x_k \right)\) for some mysterious \(f\), and we wish to approximate \(f(x)\) near these data points.
From linear algebra or the fundamental theorem of algebra, there exists exactly one polynomial of degree at most \(n\) possing through these \(n+1\) points. One way to write this polynomial is as follows: \[P(x) = \sum _{k=0}^{n} y_k\cdot \left(\prod _{\substack{i = 0,\ldots, n \\ i \neq k}} \frac{x - x_i}{x_k - x_i}\right).\]
One of the benefits of this particular form is that it’s clear that \(P\left( x_k \right) = y_k\) for each \(k=0,\ldots, n\) — all but one of the summands gets zeroed out by the product, and the remaining summand is exactly equal to \(y_k\).
However, evaluating \(P(x)\) in the above form is relatively expensive, especially if one is interpolating many, many data points. Each of the products costs \(2n\) subtractions, \(n\) divisions, and \(n\) multiplications (including the multiplication by \(y_k\)). There are \(n+1\) of these products and \(n\) additional additions in the sum, for a grand total of \(4n\left( n+1 \right)+n=4n^2+5n\) operations.
There are two general methods of evaluating and computing these interpolating polynomials efficiently: Neville’s method and the divided difference method.
Neville’s method is best for when you have a set of many, many different data points (i.e., \(n\) is very large) and you want to use a lower-order interpolating polynomial that minimises the potential error at a single point. The divided difference method is for when you know exactly how many data points you want to interpolate, but you need to evaluate the interpolating polynomial at many different points.
Error Estimates
Before continuing, interpolating polynomials are not only used when trying to interpolate an unknown function. Within the coming weeks, you’ll be learning about quadrature (i.e., numerically approximating the integral of a function). Riemann sums are okay, but they suffer from a large error, both from floating-point arithmetic and simply by nature.
So, when attempting to integrate a function, say \(\int _{0}^{1}f(x) dx\) where \(f\) is known, a common strategy is to use an interpolating polynomial for \(f\) called \(P(x)\), then integrate the interpolating polynomial. \(\int _{0}^{1}P(x) dx\) can be computed explicitly relatively easily (more on this later…), and if you use the right number of data points, one can expect \(P(x)\) to be very close to \(f(x)\).
The point is that although error estimates are generally completely useless when interpolating an unknown function, there are still scenarios where we wish to use interpolation as an approximation for a known function. In these cases, we care a lot about what the expected error will be.
Theorem 1.
Let \(f\) be an \(n+1\)-times continuously differentiable function, let \(x_0 < x_1 < \cdots < x_n\). Let \(P(x)\) be the interpolating polynomial for \(f\) at these points. Then, for any \(x\in \left[ x_0, x_n \right]\), there exists some \(\xi \in \left[ x_0,x_n \right]\) such that \[f(x) = P(x) + \frac{f ^{(n+1)}(\xi )}{(n+1)!}\prod _{k=0}^{n}\left( x-x_k \right).\]
This should remind you of the Taylor expansion estimate! One can prove this by spamming the mean value theorem and using induction on \(n\). When \(n=0\), this reduces to the ordinary mean value theorem. For \(n>0\), one can express \(P\) in terms of two lower-order interpolating polynomials; this idea underpins Neville’s method (which we will discuss shortly).
At first glance, it appears that if you have a very large number of points concentrated in a small interval, this remainder term ought to be very, very small, on one hand because most of the \(\left( x-x_k \right)\)’s will be tiny, on the other because of the \((n+1)!\) in the denominator. However, it is not true that the error term approaches zero as “\(n\to\infty\)” in general: the \(n+1\)-th derivative of \(f\) can grow very large very quickly, faster even than the product and the factorial! We’ll see an example of this in action later today. Do not make the mistake of thinking that interpolating polynomials always approach the target function.
In fact, this highlights an important dichotomy that one has to face: one should avoid using too few data points to perform interpolation, as this “loses information”, but one should also avoid using too many data points. In practise, the idea that there is such a thing as “using too many data points” is called “overfitting data”. There are often practical issues that can cause this problem, most prominently the presence of statistical biases in data samples that hypertuned models will pick up on. However, even in the complete absence of any of these issues, there is a theoretical problem as outlined above.
Neville’s Method
Let’s refresh ourselves of where we are right now: we have \(n+1\) points \(x_0< x_1\ldots< x_n\) and a mystery function \(f\) that we wish to interpolate between these points. We know what \(f\left( x_0 \right),\ldots, f\left( x_n \right)\) are, and there is a point \(x\) where we wish to estimate \(f(x)\).
Because of the error issues described in the previous section, we might be hesitant to use all \(n+1\) data points in our interpolating polynomial. But how many data points is “too many” and how many is “too few”? In this case, the best way forward is to brute-force our way through and just compute a bunch of interpolating polynomials of varying degrees, then looking for a value that many interpolating polynomials agree on.
As mentioned near the beginning, computing interpolating polynomials can be very expensive, and this idea to brute-force our way through makes this a massive problem. Neville’s method addresses this by finding a way to recycle the evaluation of two degree \(n\) interpolating polynomials to produce a degree \(n+1\) interpolating polynomial.
Recall that the points \(x_0,\ldots, x_n\) are arranged in increasing order. We will denote by \(Q _{i, j}(x)\) the interpolating polynomial for the points \(\left( x_{i-j}, f\left( x_{i-j} \right) \right), \ldots, \left( x _{i}, f\left( x _{i} \right) \right)\) (i.e., the first \(j\) points to the left of index \(i\)). Neville’s method relies on the following identity:
Proposition 2. Neville's Formula
\[Q _{i,j}(x) = \frac{1}{x_i - x _{i-j}} \left( \left( x-x _{i-j} \right)Q _{i, j-1}(x) - \left( x-x_i \right) Q _{i-1,j-1}(x) \right).\]
Watch the minus sign! Even the textbook has made errors with this. The procedure is more or less inductive and fills out a table column-by-column or row-by-row, such as the one below:
\[ \begin{array}{c | ccccc} & j=0 & j=1 & j=2 & j=3 & j=4 \\ \hline i=0 & Q _{0, 0}(x) &&&& \\ i=1 & Q _{1, 0}(x) & Q _{1, 1}(x) &&& \\ i=2 & Q _{2, 0}(x) & Q _{2, 1}(x) & Q _{2, 2}(x) && \\ i=3 & Q _{3, 0}(x) & Q _{3, 1}(x) & Q _{3, 2}(x) & Q _{3, 3}(x) & \\ i=4 & Q _{4, 0}(x) & Q _{4, 1}(x) & Q _{4, 2}(x) & Q _{4, 3}(x) & Q _{4, 4}(x) \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{array} \]
For the first column, one has \(Q _{i, 0}(x) = f\left( x_i \right)\) for any \(x\). The first column is just a copy of the data you’re trying to interpolate! Then, you can use the formula above to fill in the second column using the data in the first column. Then repeat: use the second column to fill in the third column, etc. We’ll see an example of this in action later when we implement Neville’s method in code.
Divided Differences
Neville’s method is great and all, but it’s still an immense amount of computation just to evaluate an interpolating polynomial at a single point. A degree \(n\) polynomial typically only takes \(n\) multiplications and \(n\) additions (when in nested form), and this difference in computational efficiency manifests itself very clearly when you need to interpolate thousands or millions of different points in a single function.
This is the main objective of the divided differencing method: to find an alternative representation of the interpolating polynomial that’s easier to evaluate in bulk.
The idea is to write the interpolating polynomial in the form \[P(x) = \sum _{k=0}^{n} a_k\cdot \left( \prod _{j=0}^{k-1}\left( x-x_k \right) \right) = a_0 + a_1 \left( x-x_0 \right) + a_2\left( x-x_0 \right)\left( x-x_1 \right) + \cdots,\] where the coefficients \(a_0,\ldots, a_n\) are real numbers, to be determined later. Once these coefficients are known, it is easy to compute \(P(x)\) for many distinct values of \(x\) quickly. One can even nest this polynomial as follows: \[P(x) = a_0 + \left( x-x_0 \right)\left( a_1+\left( x-x_1 \right)\left( a_2+\cdots \right) \right).\] Plugging in \(x=x_0\), and knowing that \(P\left( x_0 \right)=f\left( x_0 \right)\), all of the nonconstant terms disappear, and we’re left with \(a_0=f\left( x_0 \right)\).
Plugging in \(x=x_1\), all but the first two terms disappear, and we have \[f\left( x_1 \right)=P\left( x_1 \right)=a_0 + a_1\left( x_1-x_0 \right).\] This can be rearranged to solve for \(a_1\), yielding the so-called divided difference \[a_1 = \frac{f\left( x_1 \right)-f\left( x_0 \right)}{x_1-x_0}.\]
Remark 3.
This is the underlying principle behind the method of divided differences: we have rewritten the interpolating polynomial in a nested form where the coefficients can be solved for easily. We will momentarily introduce some gnarly notation and cumbersome formulae, but for interpolating polynomials of low degree (say, \(n < 4\)), I find it easier to just solve for the coefficients directly as above, especially when writing things out by hand.
There is a recursive formula for these coefficients in terms of the so-called divided differences. We define \[f \left[ x_i \right] = f\left( x_i \right)\] for any \(i = 0,\ldots, n\), then recursively define \[f \left[ x_i, \ldots, x _{i+j} \right] = \frac{f\left[ x _{i+1}, \ldots, x _{i+j} \right] - f\left[ x_i, \ldots, x _{i+j-1} \right]}{x _{i+j}-x_i}.\] Similar to Neville’s method, one can produce a triangular table for the divided differences as follows: \[\begin{array}{c|cccccc} & j=0 & j=1 & j=2 & j=3 & j=4 & \cdots \\ \hline i=0 & f\left[ x_0 \right] & f\left[ x_0, x_1 \right] & f \left[ x_0,\ldots, x_2 \right] & f \left[ x_0,\ldots, x_3 \right] & f\left[ x_0, \ldots, x_4 \right] & \cdots \\ i=1 & f\left[ x_1 \right] & f\left[ x_1, x_2 \right] & f\left[ x_1,\ldots, x_3 \right] & f\left[ x_1, \ldots, x_4 \right] & \cdots & \\ i=2 & f \left[ x_2 \right] & f \left[ x_2, x_3 \right] & f\left[ x_2,\ldots,x _4 \right] & \cdots && \\ i=3 & f \left[ x_3 \right] & f\left[ x_3, x_4 \right] & \cdots &&& \\ i=4 & f \left[ x_4 \right] & \cdots &&&&\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \end{array}\]
Here’s the magic: just like Neville’s method, the first column is just a copy of the data you’re trying to interpolate, and just like Neville’s method, you can fill in the table column by column using the formula for divided differences. The coefficients \(a_0, \ldots, a_n\) can be read off of the first row of the table instantly: one has \[a_j = f \left[ x_0,\ldots, x_j \right].\]
Remark 4.
If you try solving for \(a_2\) algebraically, you’ll more likely end up with the formula \[a_2 = \frac{1}{x_2-x_1} \left( \frac{f\left( x_2 \right)-f\left( x_0 \right)}{x_2-x_0} - \frac{f \left( x_1 \right)-f\left( x_0 \right)}{x_1-x_0} \right),\] which at first glance does not seem to line up with \(a_2 = f\left[ x_0,x_1,x_2 \right]\). However, it turns out that the two quantities are equal. This tabular method for divided differences works because of some algebraic madness behind the scenes that the government doesn’t want to share with us.
Implementation Details and Runge’s Phenomenon
Let’s explore how polynomial interpolation behaves graphically and numerically when we try to interpolate the function \(f(x) = \frac{1}{1+25x^2}\) on the interval \(\left[ -1, 1 \right]\).
Let’s begin with Neville’s method using \(11\) equally spaced points. We’ll first write the function neville
, which takes a pair of row vectors x_data = [x0 x1 ... x10]
and y_data = [y0 y1 ... y10]
and a point x
, then spits out the complete table described before. Let’s write the pseudocode for this really quickly:
- INPUT:
x_data
,y_data
, andx
. - Set
N
to the length ofx_data
. - Set
table
to an \(N\times N\) array of zeroes. - Set the first column of
table
toy_data
. - For each
j = 2, 3, 4, ..., N
, repeat: (j
is the column)- For each
i = j, j + 1, ..., N
, repeat: (i
is the row)- Compute
table(i, j)
according to Neville's formula.
- Compute
- For each
- OUTPUT:
table
.
In MATLAB code, it would be as follows:
1function [table] = neville(x_data, y_data, x)
2 N = length(x_data);
3 table = zeros(N);
4 table(:, 1) = y_data; % magic, see below!
5
6 for j = 2:N
7 for i = j:N
8 table(i, j) = (x - x_data(i-j+1)) * table(i, j-1);
9 table(i, j) = table(i, j) - (x - x_data(i)) * table(i-1, j-1);
10 table(i, j) = table(i, j) / (x_data(i) - x_data(i-j+1));
11 end % of inner for loop
12 end % of outer for loop
13end % of neville
There is a bit of magic in the fourth line of the above code. The syntax table(:, 1)
says “select everything in column 1”, and it copies in the vector y_data
directly! Similarly, the syntax table(4, :)
(for example) would select everything in row 4. The :
just says “take everything”. This shorthand is really helpful, and we’ll see it again when we implement the divided difference method!
Warning 5. Watch for off-by-one errors
I have been following the book’s (disagreeable) indexing notation for \(Q _{i,j}\), which allow \(i\) and \(j\) to be zero. This means that \(x_0\) in the book’s notation is x_data(1)
in the MATLAB code. These off-by-one differences can cause significant problems if you’re not careful. Why did I use x_data(i-j+1)
instead of x_data(i-j)
?
Let’s now put this function to use and see if we can approximate \(f\left( 0.05 \right)\).
1f = @(x) 1 ./ (1 + 25 * x.^2);
2x_data = linspace(-1, 1, 11);
3y_data = f(x_data);
4neville(x_data, y_data, 0.05)
When reading the table, there are some things to keep in mind:
- Pay attention to if neighbouring interpolating polynomials agree well or line up well. In this case, you should find that the values in the lower-right corner don’t vary much, hence our interpolation is reasonably good.
- Make sure you are paying attention to which points are being used in each interpolating polynomial. In this case, \(x\) falls between \(x_5\) and \(x_6\); it would not make sense to interpolate between \(x_7\) and \(x_8\) (i.e., we should ignore \(Q _{8, 1}\)).
In this case, we should use the entry in the very lower-right corner, which corresponds to \(Q _{10, 10}(0.05)\). This interpolation uses all \(11\) data points, so it’s applicable here. While this entry does not minimise the absolute error, it’s a very good choice if we don’t know what \(f\left( 0.05 \right)\) is exactly.
Now modify the code to estimate \(f\left( 0.95 \right)\). What goes wrong?
This is called Runge’s phenomenon, which more or less says that the error in the interpolating polynomial can be very, very large near the endpoints of your dataset!
Let’s switch gears and try to get a graphical view of what’s going on. We’ll plot the following three functions on the same graph:
- the original function \(f(x)\),
- the interpolating polynomial using \(5\) equally spaced data points, and
- the interpolating polynomial using \(11\) equally spaced data points.
Since we’re now graphing the interpolating polynomials, it doesn’t make a lot of sense to use Neville’s method. Instead, we should use the divided differencing method, which produces a list of coefficients that can be reused for many different evaluations at a much lower cost.
We’ll need to write two functions: one to find the coefficients, and one to evaluate the polynomial itself. Here’s the code for the two functions. Try to implement the code yourself first, and make sure you understand what my code is doing!
1function [coefs] = fit_dd(x_data, y_data)
2 N = length(x_data);
3 table = zeros(N);
4 table(:, 1) = y_data;
5
6 for j = 2:N
7 for i = 1:(N - j + 1)
8 table(i, j) = (table(i+1, j-1) - table(i, j-1));
9 table(i, j) = table(i, j) / (x_data(i+j-1) - x_data(i));
10 end % of inner for loop
11 end % of outer for loop
12
13 coefs = table(1, :);
14end % of fit_dd
Again, you need to be wary of off-by-one errors here. This ugly beast rears its head on lines 7 (in the bounds of the for loop) and 9 (when computing the divided difference). Why did I add 1?
1function [value] = eval_dd(coefs, x_data, x)
2 % uses nesting!
3 N = length(coefs);
4 value = coefs(N) * ones(size(x));
5 for i = (N - 1):-1:1
6 value = value .* (x - x_data(i)) + coefs(i);
7 end % of for loop
8end % of evaluate_dd
This implementation of eval_dd
is compatible with matrix inputs, which is really helpful for plotting functions. What is the purpose of the ones(size(x))
? Why did I use .*
instead of *
on line 6? How the heck does this even use nesting? I’ll leave the answers to these as an exercise.
The last thing to do, which I’ll leave as an exercise again, is to create a plot of \(f\) and the two interpolating polynomials for \(5\) and \(11\) evenly spaced points on the same set of axes. What happens? How can you use the error formula to explain this phenomenon?
Remark 6.
Runge’s phenomenon can occur when the data points are not chosen well, and this is reflected by the growing error term. There is a way to circumvent this by concentrating the data points closer to the boundaries of your interval of interest. This is beyond the scope of this class; look into Chebyshev nodes for more about how to “optimally choose” these interpolation points.