Hunter Liu's Website

11. Week 10: The Inverse and Implicit Function Theorems

≪ 10. Week 9: On Convolutions and On Derivatives | Table of Contents

If you’ve taken linear algebra before, you should be familiar with the “change of basis” operation. One can graphically interpret this operation as rotating and scaling the coordinate axes of a vector space; in two dimensions, the square coordinate grid then transforms into a skew parallelogram coordinate grid. In a sense, these are transforming the perspective with which one sees space.

Some great theorems of linear algebra say that given a linear transformation, there is a “canonical perspective” to take: rational canonical form, Jordan canonical form, and singular value decomposition come to mind, but algorithms such as Graham-Schmidt and Gaussian elimination also fit into this idea.

Such canonical forms are useful because the encapsulate qualitative data about the algebraic behaiour of a linear transformation. For instance, the Jordan canonical form of a matrix describes both the set of eigenvalues of the transformation and the generalised eigenspaces attached to them. The output of Gaussian elimination describes the dimension of the image (and also the null space) of the linear transformation.

In each of these cases, given a matrix ARm×nA\in \mathbb{R} ^{m\times n}, representing a linear transformation RnRm\mathbb{R}^n\to \mathbb{R}^m in the standard bases, one can find linear isomorphisms Φ:RmRm\Phi : \mathbb{R}^m\to \mathbb{R}^m and Ψ:RnRn\Psi : \mathbb{R}^n\to \mathbb{R}^n (identified with their matrices in the standard bases) such that the corresponding matrix Ψ1AΦ\Psi ^{-1} A \Phi is “nice” (diagonal, in Jordan canonical form, in reduced row echelon form, etc.). This is the idea of a “change of coordinates”.

One might wonder, is there a way to meaningfully adapt the idea of the linear change of coordinates so that C1C^1 maps RnRm\mathbb{R}^n\to \mathbb{R}^m have some sort of “canonical form”? What would even constitute a “change of coordinates”?

This is the principle of the implicit and inverse function theorems (and more generally, the rank theorem).

The inverse function theorem says that if F:RnRnF: \mathbb{R}^n\to \mathbb{R}^n is C1C^1 and has invertible derivative DFx0\left. DF\right\rvert_{x_0} for some x0x_0, then for some open neighbourhood UU of x0x_0 and VV of F(x0)F\left( x_0 \right), F:UVF : U \to V is a bijection with C1C^1 inverse. If FF was a linear function, linear algebra just tells us that FF is a change of basis itself.

The implicit function theorem, on the other hand, says that if F:RnRmF: \mathbb{R}^n\to \mathbb{R}^m is C1C^1 and DFx0\left. DF\right\rvert_{x_0} is surjective for some x0x_0, then there exist:

such that Ψ1FΦ(x1,,xn)=(x1,,xm).\Psi ^{-1} \circ F \circ \Phi \left( x_1,\ldots, x_n \right) = \left( x_1,\ldots, x_m \right).

Note the similarity between this expression and that of a linear change of basis! This says that the “canonical form” of such an FF is just projection onto the first mm coordinates, up to a change of variables. When FF is a linear function, one can replace Ψ\Psi and Φ\Phi with linear isomorphisms of the domain and codomain and arrive at exactly the same conclusion.

I’m writing things out in this somewhat funny and redundant notation/terminology because it makes very explicit the correspondence between linear algebra and the calculus of C1C^1 functions. Almost anything you can do linear-algebraically with their derivatives, you can mimick with these C1C^1 functions themselves.

I should remark that the above version of the implicit function theorem appears rather different from the version Rudin presents, which I’ll repeat here:

Theorem 1. Implicit Function Theorem

If f:Rxn×RymRmf : \mathbb{R}^{n}_x \times \mathbb{R}^{m}_y\to \mathbb{R}^m and Dxf(a,b)\left. D_xf\right\rvert_{(a, b)} is surjective for some (a,b)Rn+m(a, b)\in \mathbb{R} ^{n+m}, then there exists some open neighbourhood UU of a,b)a, b), an open neighbourhood WW of bb, and a C1C^1 function g:WRng : W\to \mathbb{R}^n, such that f(g(y),y)=f(a,b)f \left( g(y), y \right) = f(a, b) for all yWy\in W.

These two versions are equivalent! In fact, in the proof of the implicit function theorem, one must first appeal to the inverse function theorem on Rn+m\mathbb{R} ^{n+ m} before solving for this function gg. The output of the invers function theorem is exactly the Φ\Phi described above (or perhaps Φ1\Phi ^{-1}).

Let us provide an example of this principle:

Lagrange Multipliers

Suppose f,g:RnRf, g : \mathbb{R}^n\to \mathbb{R} are both C1C^1 functions, and suppose I want to find xRnx_*\in \mathbb{R}^n such that f(x)=max{f(x):g(x)=0}.f\left( x_* \right) = \max \left\lbrace f(x) : g(x) = 0 \right\rbrace. In words, we are maximising the function ff subject to the constraint that g(x)=0g(x) = 0.

We’ll assume that g0\nabla g \neq 0 on this level set of gg.

As you may know, the Lagrange multiplier theorem states that at the point xx_*, one has f(x)=λg(x)\nabla f \left( x_* \right) = \lambda \nabla g \left( x_* \right) for some constant λR\lambda \in \mathbb{R}. Thus, to solve the constrained optimisation problem, one looks for points in g1(0)g ^{-1} (0) for which f\nabla f and g\nabla g are parallel in addition to the constraint g(x)=0g(x) = 0. This usually results in only finitely many points to check.

There is a geometric heuristic behind this principle: one has the approximation f(x+h)f(x)+f(x)h,f\left( x_* + h \right) \approx f \left( x_* \right) + \nabla f\left( x_* \right)\cdot h, and likewise for gg. If hh is perpendicular to g\nabla g, then one has g(x±h)g(x)±g(x)h=0.g \left( x_* \pm h\right) \approx g \left( x_* \right) \pm \nabla g \left( x_* \right)\cdot h = 0. Thus, knowing that f(x)f(x±h)f(x)±f(x)h,f \left( x_* \right) \geq f\left( x_* \pm h \right) \approx f \left( x_* \right) \pm \nabla f \left( x_* \right) \cdot h, one must have f(x)h=0\nabla f \left( x_* \right)\cdot h = 0, i.e. f\nabla f is also perpendicular to hh. Hence f\nabla f and g\nabla g are perpendicular to the same vectors and they must be parallel.

However, this geometric heuristic is difficult to make precise. This \approx symbol must be handled delicately, and one in general cannot expect g(x+h)=0g \left( x_* + h \right) = 0 exactly. Thus either x±hx_*\pm h doesn’t stay within the constraint, or the perpendicularity against g\nabla g isn’t exact, and neither of these are acceptable. Ultimately, we’re trying to apply a linear-algebraic argument in a C1C^1 setting. Why should we expect this to work?

Here’s a more robust version of this argument: given any vector vRnv\in \mathbb{R}^n such that g(x)v=0\nabla g \left( x_* \right) \cdot v = 0, there exists a C1C^1 curve γ:(1,1)Rn\gamma : (-1, 1)\to \mathbb{R}^n such that g(γ(t))=0g \left( \gamma (t) \right) = 0 for all tt, γ(0)=x\gamma (0) = x_*, and γ(0)=v\gamma ^ \prime(0) = v. Then, fγ(t)f\circ \gamma (t) has a local maximum at 00, hence ddtfγ(t)t=0=f(x)γ(0)=f(x)v=0,\left. \frac{d}{dt} f\circ \gamma (t)\right\rvert_{t=0} = \nabla f \left( x_* \right) \cdot \gamma ^\prime (0) = \nabla f \left( x_* \right) \cdot v = 0, and we conclude that f\nabla f and g\nabla g are perpendicular by the same argument as before.

This addresses the problem of precision — we have no more of those dreaded \approx’s floating around. In fact this argument is addressing exactly the remainder issue by leveraging the chain rule. However, there is one fatal flaw: given vv perpendicular to g\nabla g, how do we know that such a γ\gamma exists? Constructing this seems somewhat difficult a priori. In fact, we haven’t used the assumptions that gg is C1C^1 and g\nabla g is nonvanishing on this level set; in the absence of these assumptions, this claim is actually false.

Let’s now give an actual proof that uses the implicit function theorem to bolster our linear-algebraic heuristics.

Given xx_*, since g(x)\nabla g \left( x_* \right) is nonzero, it’s surjective as a linear map RnR\mathbb{R}^n\to \mathbb{R}. Thus, by the implicit function theorem, there exists UU an open neighbourhood of xx_*, UU^\prime an open neighbourhood of 00, and a C1C^1 change of coordinates Φ:UU\Phi : U^\prime \to U; VV an open neighbourhood of 00, VV^\prime another open neighbourhood of 00, and Ψ:VV\Psi : V\to V^\prime a C1C^1 change of coordinates such that gΦ(x1,,xn)=Ψ(x1).g \circ \Phi \left( x_1,\ldots, x_n \right) = \Psi\left(x_1\right). We may assume without loss of generality that Φ(0)=x\Phi \left( 0 \right) = x_* and Ψ(0)=0\Psi(0)=0. By the chain rule, this implies that g(x)DΦ0=Ψ(0)e1.\nabla g \left( x_* \right) \cdot \left. D \Phi \right\rvert_{0} = \Psi^\prime(0)e_1.

Note that g\nabla g is a row vector and DΦ0\left. D \Phi \right\rvert_{0} is an n×nn\times n matrix; this e1e_1 is also a row vector.

Now let’s consider fΦ:URf\circ \Phi : U^\prime \to \mathbb{R}. Rewriting our constrained optimisation problem in these new coordinates, we know that fΦ(0)=max{fΦ(0,x2,,xn):(0,x2,,xn)U},f\circ \Phi (0) = \max \left\lbrace f\circ \Phi \left( 0, x_2, \ldots, x_n \right) : \left( 0, x_2,\ldots, x_n \right) \in U^\prime \right\rbrace, since gΦ(x1,,xn)=0g \circ \Phi \left( x_1,\ldots, x_n \right) = 0 if and only if x1=0x_1 = 0 on UU^\prime. Thus, we have that xi(fΦ)(0)=0\partial _{x_i} \left( f\circ \Phi \right)(0) = 0 for all i=2,,ni = 2,\ldots, nxi=0x_i = 0 is a local maximum when holding all the other variables fixed at 00! But we also know that (fΦ)(0)=f(x)DΦ0=[x1(fΦ)x2(fΦ)xn(fΦ)].\nabla \left( f \circ \Phi \right)(0) = \nabla f\left( x_* \right) \left. D \Phi \right\rvert_{0} = \begin{bmatrix} \partial _{x_1} \left( f\circ \Phi \right) & \partial _{x_2} \left( f \circ \Phi \right) & \cdots & \partial _{x_n} \left( f\circ \Phi \right) \end{bmatrix}. Putting these together, we get that f(x)DΦ0=ce1=λg(x)DΦ0\nabla f \left( x_* \right) \left. D \Phi \right\rvert_{0} = c e_1 = \lambda \nabla g \left( x_* \right) \left. D \Phi \right\rvert_{0} for some c,λRc, \lambda \in \mathbb{R}. It follows by the invertibility of DΦ0\left. D \Phi \right\rvert_{0} that f=λg\nabla f = \lambda \nabla g . \square

To reiterate, the point of this example is to demonstrate that linear-algebraic methods and heuristics do indeed carry over to C1C^1 functions (with mild assumptions on the derivatives). However, one needs to be careful about how one rephrases the linear-algebraic arguments, and most often the implicit function theorem or the inverse function theorem will be the way to present this link.