Hunter Liu's Website

11. Week 10: The Inverse and Implicit Function Theorems

≪ 10. Week 9: On Convolutions and On Derivatives | Table of Contents

If you’ve taken linear algebra before, you should be familiar with the “change of basis” operation. One can graphically interpret this operation as rotating and scaling the coordinate axes of a vector space; in two dimensions, the square coordinate grid then transforms into a skew parallelogram coordinate grid. In a sense, these are transforming the perspective with which one sees space.

Some great theorems of linear algebra say that given a linear transformation, there is a “canonical perspective” to take: rational canonical form, Jordan canonical form, and singular value decomposition come to mind, but algorithms such as Graham-Schmidt and Gaussian elimination also fit into this idea.

Such canonical forms are useful because the encapsulate qualitative data about the algebraic behaiour of a linear transformation. For instance, the Jordan canonical form of a matrix describes both the set of eigenvalues of the transformation and the generalised eigenspaces attached to them. The output of Gaussian elimination describes the dimension of the image (and also the null space) of the linear transformation.

In each of these cases, given a matrix \(A\in \mathbb{R} ^{m\times n}\), representing a linear transformation \(\mathbb{R}^n\to \mathbb{R}^m\) in the standard bases, one can find linear isomorphisms \(\Phi : \mathbb{R}^m\to \mathbb{R}^m\) and \(\Psi : \mathbb{R}^n\to \mathbb{R}^n\) (identified with their matrices in the standard bases) such that the corresponding matrix \[\Psi ^{-1} A \Phi \] is “nice” (diagonal, in Jordan canonical form, in reduced row echelon form, etc.). This is the idea of a “change of coordinates”.

One might wonder, is there a way to meaningfully adapt the idea of the linear change of coordinates so that \(C^1\) maps \(\mathbb{R}^n\to \mathbb{R}^m\) have some sort of “canonical form”? What would even constitute a “change of coordinates”?

This is the principle of the implicit and inverse function theorems (and more generally, the rank theorem).

The inverse function theorem says that if \(F: \mathbb{R}^n\to \mathbb{R}^n\) is \(C^1\) and has invertible derivative \(\left. DF\right\rvert_{x_0}\) for some \(x_0\), then for some open neighbourhood \(U\) of \(x_0\) and \(V\) of \(F\left( x_0 \right)\), \(F : U \to V\) is a bijection with \(C^1\) inverse. If \(F\) was a linear function, linear algebra just tells us that \(F\) is a change of basis itself.

The implicit function theorem, on the other hand, says that if \(F: \mathbb{R}^n\to \mathbb{R}^m\) is \(C^1\) and \(\left. DF\right\rvert_{x_0}\) is surjective for some \(x_0\), then there exist:

such that \[\Psi ^{-1} \circ F \circ \Phi \left( x_1,\ldots, x_n \right) = \left( x_1,\ldots, x_m \right). \]

Note the similarity between this expression and that of a linear change of basis! This says that the “canonical form” of such an \(F\) is just projection onto the first \(m\) coordinates, up to a change of variables. When \(F\) is a linear function, one can replace \(\Psi \) and \(\Phi \) with linear isomorphisms of the domain and codomain and arrive at exactly the same conclusion.

I’m writing things out in this somewhat funny and redundant notation/terminology because it makes very explicit the correspondence between linear algebra and the calculus of \(C^1\) functions. Almost anything you can do linear-algebraically with their derivatives, you can mimick with these \(C^1\) functions themselves.

I should remark that the above version of the implicit function theorem appears rather different from the version Rudin presents, which I’ll repeat here:

Theorem 1. Implicit Function Theorem

If \(f : \mathbb{R}^{n}_x \times \mathbb{R}^{m}_y\to \mathbb{R}^m\) and \(\left. D_xf\right\rvert_{(a, b)}\) is surjective for some \((a, b)\in \mathbb{R} ^{n+m}\), then there exists some open neighbourhood \(U\) of \(a, b)\), an open neighbourhood \(W\) of \(b\), and a \(C^1\) function \(g : W\to \mathbb{R}^n\), such that \[f \left( g(y), y \right) = f(a, b)\] for all \(y\in W\).

These two versions are equivalent! In fact, in the proof of the implicit function theorem, one must first appeal to the inverse function theorem on \(\mathbb{R} ^{n+ m}\) before solving for this function \(g\). The output of the invers function theorem is exactly the \(\Phi \) described above (or perhaps \(\Phi ^{-1}\)).

Let us provide an example of this principle:

Lagrange Multipliers

Suppose \(f, g : \mathbb{R}^n\to \mathbb{R}\) are both \(C^1\) functions, and suppose I want to find \(x_*\in \mathbb{R}^n\) such that \[f\left( x_* \right) = \max \left\lbrace f(x) : g(x) = 0 \right\rbrace. \] In words, we are maximising the function \(f\) subject to the constraint that \(g(x) = 0\).

We’ll assume that \(\nabla g \neq 0\) on this level set of \(g\).

As you may know, the Lagrange multiplier theorem states that at the point \(x_*\), one has \(\nabla f \left( x_* \right) = \lambda \nabla g \left( x_* \right)\) for some constant \(\lambda \in \mathbb{R}\). Thus, to solve the constrained optimisation problem, one looks for points in \(g ^{-1} (0)\) for which \(\nabla f\) and \(\nabla g\) are parallel in addition to the constraint \(g(x) = 0\). This usually results in only finitely many points to check.

There is a geometric heuristic behind this principle: one has the approximation \[f\left( x_* + h \right) \approx f \left( x_* \right) + \nabla f\left( x_* \right)\cdot h,\] and likewise for \(g\). If \(h\) is perpendicular to \(\nabla g\), then one has \[g \left( x_* \pm h\right) \approx g \left( x_* \right) \pm \nabla g \left( x_* \right)\cdot h = 0.\] Thus, knowing that \[f \left( x_* \right) \geq f\left( x_* \pm h \right) \approx f \left( x_* \right) \pm \nabla f \left( x_* \right) \cdot h,\] one must have \(\nabla f \left( x_* \right)\cdot h = 0\), i.e. \(\nabla f\) is also perpendicular to \(h\). Hence \(\nabla f\) and \(\nabla g\) are perpendicular to the same vectors and they must be parallel.

However, this geometric heuristic is difficult to make precise. This \(\approx\) symbol must be handled delicately, and one in general cannot expect \(g \left( x_* + h \right) = 0\) exactly. Thus either \(x_*\pm h\) doesn’t stay within the constraint, or the perpendicularity against \(\nabla g\) isn’t exact, and neither of these are acceptable. Ultimately, we’re trying to apply a linear-algebraic argument in a \(C^1\) setting. Why should we expect this to work?

Here’s a more robust version of this argument: given any vector \(v\in \mathbb{R}^n\) such that \(\nabla g \left( x_* \right) \cdot v = 0\), there exists a \(C^1\) curve \(\gamma : (-1, 1)\to \mathbb{R}^n\) such that \(g \left( \gamma (t) \right) = 0\) for all \(t\), \(\gamma (0) = x_*\), and \(\gamma ^ \prime(0) = v\). Then, \(f\circ \gamma (t)\) has a local maximum at \(0\), hence \[\left. \frac{d}{dt} f\circ \gamma (t)\right\rvert_{t=0} = \nabla f \left( x_* \right) \cdot \gamma ^\prime (0) = \nabla f \left( x_* \right) \cdot v = 0,\] and we conclude that \(\nabla f\) and \(\nabla g\) are perpendicular by the same argument as before.

This addresses the problem of precision — we have no more of those dreaded \(\approx\)’s floating around. In fact this argument is addressing exactly the remainder issue by leveraging the chain rule. However, there is one fatal flaw: given \(v\) perpendicular to \(\nabla g\), how do we know that such a \(\gamma \) exists? Constructing this seems somewhat difficult a priori. In fact, we haven’t used the assumptions that \(g\) is \(C^1\) and \(\nabla g\) is nonvanishing on this level set; in the absence of these assumptions, this claim is actually false.

Let’s now give an actual proof that uses the implicit function theorem to bolster our linear-algebraic heuristics.

Given \(x_*\), since \(\nabla g \left( x_* \right)\) is nonzero, it’s surjective as a linear map \(\mathbb{R}^n\to \mathbb{R}\). Thus, by the implicit function theorem, there exists \(U\) an open neighbourhood of \(x_*\), \(U^\prime\) an open neighbourhood of \(0\), and a \(C^1\) change of coordinates \(\Phi : U^\prime \to U\); \(V\) an open neighbourhood of \(0\), \(V^\prime\) another open neighbourhood of \(0\), and \(\Psi : V\to V^\prime\) a \(C^1\) change of coordinates such that \[g \circ \Phi \left( x_1,\ldots, x_n \right) = \Psi\left(x_1\right).\] We may assume without loss of generality that \(\Phi \left( 0 \right) = x_*\) and \(\Psi(0)=0\). By the chain rule, this implies that \[\nabla g \left( x_* \right) \cdot \left. D \Phi \right\rvert_{0} = \Psi^\prime(0)e_1.\]

Note that \(\nabla g\) is a row vector and \(\left. D \Phi \right\rvert_{0} \) is an \(n\times n\) matrix; this \(e_1\) is also a row vector.

Now let’s consider \(f\circ \Phi : U^\prime \to \mathbb{R}\). Rewriting our constrained optimisation problem in these new coordinates, we know that \[f\circ \Phi (0) = \max \left\lbrace f\circ \Phi \left( 0, x_2, \ldots, x_n \right) : \left( 0, x_2,\ldots, x_n \right) \in U^\prime \right\rbrace,\] since \(g \circ \Phi \left( x_1,\ldots, x_n \right) = 0\) if and only if \(x_1 = 0\) on \(U^\prime\). Thus, we have that \(\partial _{x_i} \left( f\circ \Phi \right)(0) = 0\) for all \(i = 2,\ldots, n\) — \(x_i = 0\) is a local maximum when holding all the other variables fixed at \(0\)! But we also know that \[\nabla \left( f \circ \Phi \right)(0) = \nabla f\left( x_* \right) \left. D \Phi \right\rvert_{0} = \begin{bmatrix} \partial _{x_1} \left( f\circ \Phi \right) & \partial _{x_2} \left( f \circ \Phi \right) & \cdots & \partial _{x_n} \left( f\circ \Phi \right) \end{bmatrix}.\] Putting these together, we get that \[\nabla f \left( x_* \right) \left. D \Phi \right\rvert_{0} = c e_1 = \lambda \nabla g \left( x_* \right) \left. D \Phi \right\rvert_{0}\] for some \(c, \lambda \in \mathbb{R}\). It follows by the invertibility of \(\left. D \Phi \right\rvert_{0}\) that \(\nabla f = \lambda \nabla g \). \(\square \)

To reiterate, the point of this example is to demonstrate that linear-algebraic methods and heuristics do indeed carry over to \(C^1\) functions (with mild assumptions on the derivatives). However, one needs to be careful about how one rephrases the linear-algebraic arguments, and most often the implicit function theorem or the inverse function theorem will be the way to present this link.