8 Quadratic forms and classification of stationary points

Let’s summarise what we did in Chapter 6. We saw there that the Taylor series around a stationary point (so the degree 1 terms vanish) looks like \[f(a+h,b+k)=f(a,b)+\frac{1}{2}f_{xx}(a,b)h^2+f_{xy}(a,b)hk+\frac{1}{2}f_{yy}(ab)k^2+\cdots.\] In order to see the behaviour near a stationary point, we must look at these degree 2 terms; if they always give a positive contribution for any \(h\) and \(k\), then we are at a minimum, since any small movement away from \((a,b)\) will then increase the value of the function. Similarly, if the degree 2 terms give a negative contribution for any \((h,k)\ne(0,0)\), then we are at a maximum, and otherwise we are at neither (a saddle point, for example). Near the stationary point, these terms describe how the function behaves, since higher order terms are (comparatively) negligible. The second order terms describe quadratic curves, and we’ve seen that they are basically ellipses or hyperbolas (the cases of circles and parabolas are special cases). When the level curves are ellipses, we expect to be at a maximum or minimum, and when the level curves are hyperbolas, we expect to be at a saddle point.

8.1 Quadratic forms

Suppose we have a function \(f(x,y)\) of two variables, which has a stationary point at \((x,y)=(a,b)\). That is, at \((a,b)\), \[\frac{\partial f}{\partial x}=\frac{\partial f}{\partial y}=0.\] In this case, the Taylor expansion of \(f\) about \((a,b)\) gives \[f(a+h,b+k)\approx f(a,b)+\frac{1}{2!}\frac{\partial^2f}{\partial x^2}h^2+\frac{1}{2!}\frac{\partial^2f}{\partial y^2}k^2+\frac{\partial^2f}{\partial x\,\partial y}hk.\] If \((a,b)\) is a maximum, we expect that any move away from \((a,b)\) will cause a decrease in the value of the function. Similarly, if \((a,b)\) is a mimimum, we expect that any move away from \((a,b)\) will cause an increase in the value of the function. Any situation where a move away from \((a,b)\) has different effects will be neither a maximum nor a minimum.

But we can see how the function changes as we move away from \((a,b)\) – if we move to a nearby point \((a+h,b+k)\), the function changes by \[\frac{1}{2!}\frac{\partial^2f}{\partial x^2}h^2+\frac{1}{2!}\frac{\partial^2f}{\partial y^2}k^2+\frac{\partial^2f}{\partial x\,\partial y}hk.\] Let’s put \[Q(h,k)=\frac{\partial^2f}{\partial x^2}h^2+\frac{\partial^2f}{\partial y^2}k^2+2\frac{\partial^2f}{\partial x\,\partial y}hk\] so that \(f(a+h,b+k)\approx f(a,b)+\frac{1}{2}Q(h,k)\). If \(Q(h,k)<0\) for all choices of \(h\) and \(k\) (except \((h,k)=(0,0)\)), then the function will have a maximum. If it is always positive, then the function will have a minimum. If it is sometimes positive and sometimes negative, the function will have a saddle point. (There are other situations, when the function is not always negative, but is never positive, because there are choices of \(h\) and \(k\) so that the change is \(0\), but we shall omit consideration of this case.)

Let’s write \[p=\frac{\partial^2f}{\partial x^2},\qquad q=\frac{\partial^2f}{\partial x\,\partial y},\qquad r=\frac{\partial^2f}{\partial y^2}.\] We then see that \[f(a+h,b+k)=f(a,b)+\frac{1}{2}\left(ph^2+2qhk+rk^2\right)\] and we need to think about the change \(ph^2+2qhk+rk^2\).

Definition 8.1 A quadratic form in two variables \(x\) and \(y\) is a homogeneous second degree expression of the form \[px^2+2qxy+ry^2.\]

Exercise 8.1 Write \(\mathbf{v}=\begin{pmatrix}x\\y\end{pmatrix}\), and let \(A\) denote the matrix \(\begin{pmatrix}p&q\\q&r\end{pmatrix}\). Show that \[px^2+2qxy+ry^2=\mathbf{v}^TA\mathbf{v}.\]

More generally, if we have \(n\) variables \(x_1,\ldots,x_n\), and a symmetric \(n\times n\) matrix \(A\), a quadratic form is an expression \(\mathbf{v}^TA\mathbf{v}\), where \(\mathbf{v}=\begin{pmatrix}x_1\\\vdots\\x_n\end{pmatrix}\). Conversely, given a homogeneous degree two expression \[\sum_{i\le j}a_{ij}x_ix_j,\] we can find a real symmetric matrix \(A\) so that this is equal to \(\mathbf{v}^TA\mathbf{v}\).

Example 8.1 Find the quadratic form associated to the symmetric matrix \(\begin{pmatrix}1&2&3\\2&5&6\\3&6&9\end{pmatrix}\).

For this, we put \(\mathbf{v}=\begin{pmatrix}x\\y\\z\end{pmatrix}\). We work out \[\begin{eqnarray*} \mathbf{v}^TA\mathbf{v}&=&\begin{pmatrix}x~y~z\end{pmatrix}\begin{pmatrix}1&2&3\\2&5&6\\3&6&9\end{pmatrix}\begin{pmatrix}x\\y\\z\end{pmatrix}\\ &=&\begin{pmatrix}x~y~z\end{pmatrix}\begin{pmatrix}x+2y+3z\\2x+5y+6z\\3x+6y+9z\end{pmatrix}\\ &=&x(x+2y+3z)+y(2x+5y+6z)+z(3x+6y+9z)\\ &=&x^2+4xy+6xz+5y^2+12z+9z^2. \end{eqnarray*}\]

Notice that the coefficients of \(x^2\), \(y^2\) and \(z^2\) are the diagonal entries of the matrix, and the coefficients of \(xy\), \(xz\) and \(yz\) are twice the off-diagonal entries of the matrix. Think about why this is!

More generally, given an \(n\times n\) symmetric matrix \(A\), and \(\mathbf{v}=\begin{pmatrix}x_1\\\vdots\\x_n\end{pmatrix}\), then \[\mathbf{v}^TA\mathbf{v}=\sum_{i=1}^n\sum_{j=1}^na_{ij}x_ix_j=\sum_{i=1}^na_{ii}x_i^2+\sum_{i<j}2a_{ij}x_ix_j.\]

Example 8.2 Find the symmetric matrix associated to the quadratic form \[x_1^2+4x_2^2-3x_3^2+2x_1x_2-4x_1x_3+3x_2x_3.\] Using the above, we have \[A=\begin{pmatrix}1&1&-2\\1&4&3/2\\-2&3/2&-3\end{pmatrix}.\]

From the discussion above, let’s make the following definitions:

Definition 8.2 We say that a quadratic form \(Q(x,y)\) is

positive definite if \(Q(x,y)>0\) for all \((x,y)\ne(0,0)\);
negative definite if \(Q(x,y)<0\) for all \((x,y)\ne(0,0)\);
indefinite if \(Q(x,y)\) takes at least one positive value and at least one negative value.

As simple examples, we have (1) \(Q(x,y)=x^2+y^2\), (2) \(Q(x,y)=-x^2-y^2\), and (3) \(Q(x,y)=x^2-y^2\).

For our application to classification of stationary points, we will replace \(x\) and \(y\) with \(h\) and \(k\) (indicating that these are small changes). For the quadratic form at a stationary point, note that we have \[Q(h,k)=\begin{pmatrix}h&k\end{pmatrix}\begin{pmatrix}f_{xx}&f_{xy}\\f_{xy}&f_{yy}\end{pmatrix}\begin{pmatrix}h\\k\end{pmatrix}.\] The matrix \(\begin{pmatrix}f_{xx}&f_{xy}\\f_{xy}&f_{yy}\end{pmatrix}\) is the Hessian matrix for \(f\). At any point \((a,b)\), we can compute the actual values of \(f_{xx}(a,b)\) etc., and form the expression \(Q(h,k)\). We need to see whether it always takes positive values for nonzero \((h,k)\) (for a minimum), always negative values (for a maximum) or something else.

Notice that the Hessian is a real symmetric matrix. Such matrices have rather beautiful properties which we will consider below.

8.2 Real symmetric matrices

Real symmetric matrices have some particularly nice properties. They also occur often (we’ll see examples arising from the stationary points of surfaces, but you will also see examples in statistics, since they occur naturally as multivariate variance matrices).

The following fact is rather useful, and we state it without proof:

If \(A\) is a real symmetric \(n\times n\) matrix, then there are always \(n\) linearly independent eigenvectors, even if some of the eigenvalues are the same.

In particular, we can apply Theorem 5.1 to see that real symmetric matrices are diagonalisable.

But there are other nice properties of real symmetric matrices.

Proposition 8.1 Suppose that \(A\) is a real symmetric matrix, i.e., \(A=A^T\). Then the eigenvalues of \(A\) are real.

Proof. Indeed, if \(Ax=\lambda x\), then \(\overline{x}^TAx=\lambda\overline{x}^Tx\); applying transposition, and taking complex conjugation implies that also \(\overline{x}^TAx=\overline{\lambda}\overline{x}^Tx\). As \(x\ne0\), we get \(\lambda=\overline{\lambda}\).

Recall that two vectors in \(\mathbb{R}^3\) are orthogonal if their dot product is \(0\). Notice that if \(x=\begin{pmatrix}x_1\\\vdots\\x_n\end{pmatrix}\) and \(y=\begin{pmatrix}y_1\\\vdots\\y_n\end{pmatrix}\), then this dot product is easily written in any number of dimensions as \[x^Ty=x_1y_1+\cdots+x_ny_n.\] We say that two vectors in \(\mathbb{R}^n\) are orthogonal if \(x^Ty=0\). The last main result we need about real symmetric matrices is the following:

Proposition 8.2 Suppose that \(A\) is a real symmetric matrix. If \(\lambda\) and \(\mu\) are distinct eigenvalues and \(x\) and \(y\) are corresponding eigenvectors, then \(x\) and \(y\) are orthogonal.

Proof. Indeed, we have \(Ax=\lambda x\) and \(Ay=\mu y\). So \(y^TAx=\lambda y^Tx\) and \(x^TAy=\mu x^Ty\). Taking transposes of the latter gives \(y^TAx=\mu y^Tx\). As \(\lambda\ne\mu\), we get \(y^Tx=0\).

Let’s summarise the main properties of real symmetric matrices:

If \(A\) denotes a real symmetric \(n\times n\) matrix, then \(A\) has \(n\) linearly independent eigenvectors (even when some eigenvalues are the same).
All the eigenvalues of \(A\) are real.
Eigenvectors corresponding to distinct eigenvalues are orthogonal.

If we have a repeated eigenvalue, we can choose linearly independent eigenvectors with that eigenvalue to be orthogonal. So in particular:

If \(A\) is a real symmetric matrix, then it is diagonalisable, and we can choose all the eigenvectors to be orthogonal to each other.

Example 8.3 If \(A=\begin{pmatrix}2&-1\\-1&2\end{pmatrix}\), the eigenvectors are \(\begin{pmatrix}1\\1\end{pmatrix}\), with eigenvalue \(1\), and \(\begin{pmatrix}1\\-1\end{pmatrix}\), with eigenvalue \(3\), and these two vectors are clearly orthogonal.

Example 8.4 The matrix \(A=\begin{pmatrix} 3&2&2\\ 2&2&0\\ 2&0&4 \end{pmatrix}\) has eigenvectors \(\mathbf{v}_1=\begin{pmatrix}2\\1\\2\end{pmatrix}\) (with \(\lambda_1=6\)), \(\mathbf{v}_2=\begin{pmatrix}1\\2\\-2\end{pmatrix}\) (with \(\lambda_2=3\)) and \(\mathbf{v}_3=\begin{pmatrix}-2\\2\\1\end{pmatrix}\) (with \(\lambda_3=0\)). One easily checks that \(\mathbf{v}_1^T\mathbf{v}_2=0\), and similarly for any other pair of eigenvectors.

8.3 Orthogonal matrices

Definition 8.3 If \(M\) is a real matrix such that \(M^TM=I\), then \(M\) said to be an orthogonal matrix.

Let’s think about why this is. If the columns of \(M\) are \(\mathbf{v}_1\),…, \(\mathbf{v}_n\), then the rows of \(M^T\) are just \(\mathbf{v}^T_1\),…, \(\mathbf{v}_n^T\). Then the \(ij\)th entry of the product \(M^TM\) is got by taking the \(i\)th row of \(M^T\) and the \(j\)th column, and taking their “dot” product, and this means that the \(ij\)th entry is just \(\mathbf{v}_i^T\mathbf{v}_j\). If \(M^TM=I\), then the off-diagonal elements are all \(0\), meaning that the columns \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are all orthogonal if \(i\ne j\); also, it means that \(\mathbf{v}_i^T\mathbf{v}_i=1\) for all \(i\), which is equivalent to each \(\mathbf{v}_i\) having length \(1\).

Orthogonal matrices have another interesting property. Recall that earlier we thought of matrices as geometrical transformations from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). In the case of a square \(n\times n\)-matrix, we get a map \(\mathbb{R}^n\longrightarrow\mathbb{R}^n\).

Above, we generalised the dot product to \(\mathbb{R}^n\) by defining \(\mathbf{x}.\mathbf{y}=\mathbf{x}^T\mathbf{y}\). As in the 2- and 3-dimensional case, we can use this to calculate the angle between vectors.

If \(M\) is an orthogonal matrix, then it has the property of preserving the dot product, and therefore of preserving angles: \[(M\mathbf{x})^T(M\mathbf{y})=\mathbf{x}^TM^TM\mathbf{y}=\mathbf{x}^T\mathbf{y}.\] In particular, if \(\mathbf{y}=\mathbf{x}\), we see that \(\|M\mathbf{x}\|^2=\|\mathbf{x}\|^2\), so multiplication by \(M\) does not change the length of the vector; angles between vectors are conserved, etc. Examples are reflections and rotations, and combinations of these (and this is all!).

We mentioned above that orthogonal matrices have the property that their columns are orthogonal, and have length \(1\). We saw in the previous section that eigenvectors of real symmetric matrices can be chosen to have this property! In particular, if we choose \(n\) orthogonal eigenvectors of a real symmetric \(n\times n\) matrix \(A\), and scale them to have length \(1\), we can make them into the columns of an orthogonal matrix \(M\). Recall that this is exactly the matrix which diagonalises \(A\).

Example 8.5 Put \(M=\begin{pmatrix} \frac{1}{\sqrt{2}}&-\frac{1}{\sqrt{2}}\\ \frac{1}{\sqrt{2}}&\frac{1}{\sqrt{2}} \end{pmatrix}\). Then it is easy to calculate that \(M^TM=I\).

Note that orthogonal matrices have the property that \(M^{-1}=M^T\).

Let’s return to the case where \(A\) is a real symmetric \(n\times n\) matrix, and let \(M\) be the matrix whose columns are the normalised eigenvectors \(\mathbf{v}_1,\ldots,\mathbf{v}_n\). We have already explained that \(M\) is an orthogonal matrix. But we also know that \(M\) has the property that \(M^{-1}AM\) is the diagonal matrix of eigenvalues. So, in summary:

Theorem 8.1 Suppose that \(A\) is a square matrix which is real and symmetric. Then there are \(n\) linearly independent orthogonal eigenvectors of length \(1\); if \(M\) denotes the matrix with these vectors as columns, then \(M\) is an orthogonal matrix (i.e., \(M^{-1}=M^T\)) and \(M^{-1}AM=M^TAM\) is diagonal with the eigenvalues down the leading diagonal.

Thus real symmetric matrices are diagonalisable in a new sense:

If \(M\) is the matrix whose columns are the normalised eigenvectors of a real symmetric matrix \(A\), then \(M^TAM\) is also diagonal with eigenvalues down the diagonal.

Exercise 8.2 Again, let \(A=\begin{pmatrix}2&-1\\-1&2\end{pmatrix}\). We calculated the eigenvectors in Example 8.3, and so we put \(M=\begin{pmatrix}1/\sqrt{2}&1/\sqrt{2}\\1/\sqrt{2}&-1/\sqrt{2}\end{pmatrix}\), so that the columns of \(M\) are scaled versions of the eigenvectors.

Check that \(M^TAM=\begin{pmatrix}1&0\\0&3\end{pmatrix}\).

Exercise 8.3 Verify this for the matrix \(\begin{pmatrix}2&1\\1&2\end{pmatrix}\).

Exercise 8.4 Verify this for the matrix \(\begin{pmatrix}-3&-3&-3\\-3&1&-1\\-3&-1&1\end{pmatrix}\).

Exercise 8.5 The \(2\times2\) symmetric matrix \(A\) has eigenvalues \(2\) and \(-1\). The eigenvector corresponding to the eigenvalue \(2\) is \(\begin{pmatrix}1\\1\end{pmatrix}\). Find an eigenvector corresponding to the eigenvalue \(-1\).

Form the matrix \(M\) of normalised eigenvectors, and the diagonal matrix \(S\) of eigenvalues; given that \(M^{-1}AM=S\), compute \(A\).

8.4 Classification of stationary points

Going back to our function \(f(x,y)\) of two variables, we get an associated quadratic form \[Q=\frac{\partial^2f}{\partial x^2}h^2+2\frac{\partial^2f}{\partial x\partial y}hk+\frac{\partial^2f}{\partial y^2}k^2=\begin{pmatrix}h&k\end{pmatrix}\begin{pmatrix}\frac{\partial^2f}{\partial x^2}&\frac{\partial^2f}{\partial x\,\partial y}\\\frac{\partial^2f}{\partial x\,\partial y}&\frac{\partial^2f}{\partial y^2}\end{pmatrix}\begin{pmatrix}h\\k\end{pmatrix},\] which describes the nature of the stationary point. We see that if \(Q\) is positive definite, then \(f\) has a minimum; if \(Q\) is negative definite, then \(f\) has a maximum, and if \(Q\) is indefinite, then \(f\) has a saddle point.

Recall that we saw that a real symmetric matrix \(A\) always has real eigenvalues, and that we can find an orthogonal matrix \(M\) whose columns are eigenvectors of length \(1\) such that \[M^TAM=M^{-1}AM=D,\] a diagonal matrix with the eigenvalues down the diagonal. Now we have:

Given a quadratic form in two (or more) variables, we can write it as \(\mathbf{v}^TA\mathbf{v}\) for some real symmetric matrix \(A\).
Given a real symmetric matrix \(A\), there is a matrix \(M\) (whose columns are the normalised eigenvectors of \(A\)) such that \(M^TAM\) is diagonal.

Let’s see where these two ideas lead.

Now if \(D=M^TAM=M^{-1}AM\), we have \[A=MM^{-1}AMM^{-1}=MDM^{-1}=MDM^T.\] Writing \(\mathbf{v}=\begin{pmatrix}h\\k\end{pmatrix}\), we also know that: \[\begin{aligned} Q(h,k)&=\mathbf{v}^TA\mathbf{v}\\ &=\mathbf{v}^TMDM^T\mathbf{v}\\ &=(M^T\mathbf{v})^TD(M^T\mathbf{v})\\ &=\mathbf{v}'{}^TD\mathbf{v}'\\ &=\begin{pmatrix}h'&k'\end{pmatrix}\begin{pmatrix}\lambda_1&0\\0&\lambda_2\end{pmatrix}\begin{pmatrix}h'\\k'\end{pmatrix}\\ &=\lambda_1h'^2+\lambda_2k'^2,\end{aligned}\] where \(\mathbf{v}'=M^T\mathbf{v}=\begin{pmatrix}h'\\k'\end{pmatrix}\), and where \(\lambda_1\) and \(\lambda_2\) are the eigenvalues of \(A\). Clearly

\(Q(h,k)>0\) for all nonzero \((h,k)\) when \(\lambda_1,\lambda_2>0\).
\(Q(h,k)<0\) for all nonzero \((h,k)\) when \(\lambda_1,\lambda_2<0\).
\(Q(h,k)\) is sometimes positive and sometimes negative when either \(\lambda_1>0\) and \(\lambda_2<0\) or vice versa.

(If either eigenvalue is \(0\), more work is needed to check the nature of the stationary point.) Note that we don’t need the actual eigenvalues, just their signs. Since the determinant of a matrix is the product of the eigenvalues, the case where \(\lambda_1>0\) and \(\lambda_2<0\) or vice versa can be checked by working out the determinant of the Hessian matrix; this will be \(\lambda_1\lambda_2\), and if it is negative, then one eigenvalue is positive and the other is negative.

Now we can classify stationary points:

Find the stationary points \((a,b)\), by solving \(f_x(a,b)=f_y(a,b)=0\).
For each stationary point \((a,b)\), evaluate all the second partial derivatives at \((a,b)\) to get the values of \(f_{xx}(a,b)\), \(f_{xy}(a,b)\) and \(f_{yy}(a,b)\), and form the Hessian matrix.
Work out the determinant of the Hessian matrix.
- If the determinant of the Hessian is negative, then one eigenvalue is positive and the other is negative, so we must be at a saddle point.
- If the determinant of the Hessian is positive, then both eigenvalues have the same sign, and \(Q\) is always positive or always negative away from \((0,0)\). Since \(Q(1,0)=f_{xx}(a,b)\), we can say that if \(f_{xx}(a,b)\) is positive, then \(Q\) is always positive, and we are at a minimum, and if \(f_{xx}(a,b)\) is negative, then \(Q\) is always negative, so that we are at a maximum.

Example 8.6 Consider \(f(x,y)=xye^{-x^2-y^2}\). Then (exercise): \[\begin{aligned} f_x&=y(1-2x^2)e^{-x^2-y^2}\\ f_y&=x(1-2y^2)e^{-x^2-y^2}\\ f_{xx}&=(-6xy+4x^3y)e^{-x^2-y^2}\\ f_{yy}&=(-6xy+4xy^3)e^{-x^2-y^2}\\ f_{xy}&=(1-2x^2-2y^2+4x^2y^2)e^{-x^2-y^2}.\end{aligned}\] We see that both \(f_x\) and \(f_y\) simultaneously vanish at \((0,0)\) and at \((\pm\frac{1}{\sqrt{2}},\pm\frac{1}{\sqrt{2}})\).

At \((0,0)\), we see \(f_{xx}(0,0)=f_{yy}(0,0)=0\) and \(f_{xy}(0,0)=1\), so the Hessian matrix is \(\begin{pmatrix}0&1\\1&0\end{pmatrix}\). Since the determinant is \(-1\), we have a saddle point.

Similarly, at \((\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) and at \((-\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}})\), we see that the Hessian is \(\begin{pmatrix}-\frac{2}{e}&0\\0&-\frac{2}{e}\end{pmatrix}\). The determinant is positive, and the top left-hand entry is negative, so we are at a maximum.

Finally, at \((\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}})\) and at \((-\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\), we see that the Hessian is \(\begin{pmatrix}\frac{2}{e}&0\\0&\frac{2}{e}\end{pmatrix}\). The determinant is positive, and the top left-hand entry is positive, so we are at a minimum.

In fact, we are diagonalising the quadratic form:

Example 8.7 Suppose \(f(x,y)=2x^2+2xy+2y^2\).

We know that if \(\mathbf{v}=\begin{pmatrix}x\\y\end{pmatrix}\), then \(f(x,y)=\mathbf{v}^TA\mathbf{v}\) where \(A=\begin{pmatrix}2&1\\1&2\end{pmatrix}\).

We can easily compute that the eigenvectors of \(A\) are \(\begin{pmatrix}1\\1\end{pmatrix}\), with eigenvalue 3, and \(\begin{pmatrix}1\\-1\end{pmatrix}\), with eigenvalue 1.

We form \(M=\begin{pmatrix}\frac{1}{\sqrt{2}}&\frac{1}{\sqrt{2}}\\\frac{1}{\sqrt{2}}&-\frac{1}{\sqrt{2}}\end{pmatrix}\), so that the columns of \(M\) are the normalised eigenvectors of \(A\). We know that \[M^TAM=M^{-1}AM=D,\] the diagonal matrix whose diagonal entries are the corresponding eigenvalues, so \(D=\begin{pmatrix}3&0\\0&1\end{pmatrix}\).

As \(M^{-1}AM=D\), we see that \(A=MDM^{-1}=MDM^T\).

Then we can rewrite our quadratic form as \[\mathbf{v}^TA\mathbf{v}=\mathbf{v}^TMDM^T\mathbf{v}=(M^T\mathbf{v})^TD(M^T\mathbf{v}).\] Note that \[M^T\mathbf{v}=\begin{pmatrix}\frac{1}{\sqrt{2}}&\frac{1}{\sqrt{2}}\\\frac{1}{\sqrt{2}}&-\frac{1}{\sqrt{2}}\end{pmatrix}\begin{pmatrix}x\\y\end{pmatrix}=\begin{pmatrix}(x+y)/\sqrt{2}\\(x-y)/\sqrt{2}\end{pmatrix}.\] So \[\begin{eqnarray*} \mathbf{v}^TA\mathbf{v}&=&\mathbf{v}^TMDM^T\mathbf{v}=(M^T\mathbf{v})^TD(M^T\mathbf{v})\\ &=&\begin{pmatrix}(x+y)/\sqrt{2}&(x-y)/\sqrt{2}\end{pmatrix}\begin{pmatrix}3&0\\0&1\end{pmatrix}\begin{pmatrix}(x+y)/\sqrt{2}\\(x-y)/\sqrt{2}\end{pmatrix}\\ &=&3\left(\frac{x+y}{\sqrt{2}}\right)^2+1\left(\frac{x-y}{\sqrt{2}}\right)^2\\ &=&\frac{3}{2}(x+y)^2+\frac{1}{2}(x-y)^2. \end{eqnarray*}\] We have diagonalised the quadratic form – we have taken our original form \(\mathbf{v}^TA\mathbf{v}\) and rewritten it as \(\mathbf{v}'{}^TD\mathbf{v}'\), where \(D\) is diagonal, and \(\mathbf{v}'=\begin{pmatrix}x'\\y'\end{pmatrix}\) with \(x'=(x+y)/\sqrt{2}\) and \(y'=(x-y)/\sqrt{2}\): \[f(x,y)=2x^2+2xy+2y^2=3x'{}^2+y'{}^2.\]

Clearly the same kind of thing happens for any quadratic form (with any number of variables). We can write any quadratic form as \(\mathbf{v}^TA\mathbf{v}\), and using \(A=MDM^T\) for a diagonal matrix \(D\), we find that \[\mathbf{v}^TA\mathbf{v}=\mathbf{v}^TMDM^T\mathbf{v}=(M^T\mathbf{v})^TD(M^T\mathbf{v})=\mathbf{v}'{}^TD\mathbf{v}'\] if we write \(\mathbf{v}'=M^T\mathbf{v}\).

Example 8.8 Diagonalise \(x^2+4xy-2y^2\).

We need to write the form as \(\mathbf{v}^TA\mathbf{v}\), where \(A=\begin{pmatrix}1&2\\2&-2\end{pmatrix}\). The eigenvalues of \(A\) are \(2\) and \(-3\); the corresponding eigenvectors are \(\begin{pmatrix}2\\1\end{pmatrix}\) and \(\begin{pmatrix}1\\-2\end{pmatrix}\). Each of these has length \(\sqrt{5}\), so we can normalise them by dividing both by \(\sqrt{5}\). We then write \(M=\begin{pmatrix}\frac{2}{\sqrt{5}}&\frac{1}{\sqrt{5}}\\\frac{1}{\sqrt{5}}&-\frac{2}{\sqrt{5}}\end{pmatrix}\). In this example, \(M=M^T\); we form \[\mathbf{v}'=M^T\mathbf{v}=\begin{pmatrix}\frac{2}{\sqrt{5}}&\frac{1}{\sqrt{5}}\\\frac{1}{\sqrt{5}}&-\frac{2}{\sqrt{5}}\end{pmatrix}\begin{pmatrix}x\\y\end{pmatrix}=\begin{pmatrix}(2x+y)/\sqrt{5}\\(x-2y)/\sqrt{5}\end{pmatrix},\] and \(D=\begin{pmatrix}2&0\\0&-3\end{pmatrix}\). Then \[x^2+4xy-2y^2=\mathbf{v}'{}^TD\mathbf{v}'=2\left(\frac{2x+y}{\sqrt{5}}\right)^2-3\left(\frac{x-2y}{\sqrt{5}}\right)^2.\] Of course, we can simplify this as \[\frac{2}{5}(2x+y)^2-\frac{3}{5}(x-2y)^2.\]

There are lots of other ways of diagonalising quadratic forms (e.g., completing the square gives \(x^2+4xy-2y^2=(x+2y)^2-6y^2\)). The one using eigenvectors is the most useful in practice. It is known (Sylvester’s Law of Inertia) that, given a quadratic form, every way to diagonalise it will have the same number of \(+\) signs, and the same number of \(-\) signs. The eigenvector method gives us the directions of the main (“principal”) axes for the function, in the same sort of sense that an ellipse has two axes, corresponding to the furthest and closest points from the origin.

Given the quadratic form \(Q=\mathbf{v}^TA\mathbf{v}\), put \(\mathbf{v}=M\mathbf{w}\), so that \(\mathbf{w}=M^{-1}\mathbf{v}=M^T\mathbf{v}\). Then the quadratic form becomes \[Q=\mathbf{v}^TA\mathbf{v}=(M\mathbf{w})^TA(M\mathbf{w})=\mathbf{w}^TM^TAM\mathbf{w}=\mathbf{w}^TD\mathbf{w}.\] If \(\mathbf{w}=\begin{pmatrix}w_1\\\vdots\\w_n\end{pmatrix}\) and \(D=\begin{pmatrix}\lambda_1&0&\ldots&0\\0&\lambda_2&\ldots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\ldots&\lambda_n\end{pmatrix}\), then \[\mathbf{w}^TD\mathbf{w}=\lambda_1w_1^2+\cdots+\lambda_nw_n^2.\] It is very easy to see when quadratic forms like this are positive definite; we just need all \(\lambda_i>0\). Similarly, \(Q\) is negative definite if every \(\lambda_i<0\), and \(Q\) is indefinite if some \(\lambda_i>0\) and other \(\lambda_j<0\).

We conclude that if \(Q\) is a quadratic form \(\mathbf{v}^TA\mathbf{v}\), then

\(Q\) is positive definite if \(Q>0\) for all vectors \(\mathbf{v}\ne0\) and this happens if all the eigenvalues of \(A\) are positive;
\(Q\) is negative definite if \(Q<0\) for all vectors \(\mathbf{v}\ne0\) and this happens if all the eigenvalues of \(A\) are negative;
\(Q\) is indefinite if \(Q\) takes at least one positive value and at least one negative value and this happens if some of the eigenvalues of \(A\) are positive and others are negative.

For example, the form \(x^2+4xy-2y^2\) corresponds to a saddle point; the expression \[x^2+4xy-2y^2=\frac{2}{5}(2x+y)^2-\frac{3}{5}(x-2y)^2\] means that the function increases most quickly when \(x-2y=0\) (since there is then no negative contribution to the value) and decreases most quickly when \(2x+y=0\).

As a summary of this section, the Taylor series of a function \(f(x,y)\) of 2 variables at a point \((a,b)\) can be written:

\[f(a+h,b+k)\approx f(a,b)+\begin{pmatrix}\frac{\partial f}{\partial x}&\frac{\partial f}{\partial y}\end{pmatrix}\begin{pmatrix}h\\k\end{pmatrix}+\frac{1}{2!}\begin{pmatrix}h&k\end{pmatrix}\begin{pmatrix}\frac{\partial^2f}{\partial x^2}&\frac{\partial^2f}{\partial x\,\partial y}\\\frac{\partial^2f}{\partial x\,\partial y}&\frac{\partial^2f}{\partial y^2}\end{pmatrix}\begin{pmatrix}h\\k\end{pmatrix};\]

(where we simply write \(\frac{\partial f}{\partial x}\) etc., to stand for its value at the point \((a,b)\) for simplicity). As for functions of 1 variable, the function has a stationary point when the linear terms vanish, and then the nature of the stationary point can (usually) be read off from eigenvalues of the matrix of second partial derivatives occurring in the quadratic term in the Taylor series (this matrix is known as the Hessian matrix).

8.5 Constrained optimisation

We have explained how to find the maximum of a function of 2 variables: we simply put all the first order partial derivatives equal to \(0\) to find the stationary points, and then see whether the eigenvalues of the Hessian matrix at one or more of these points are all negative.

But it is quite common in mathematics to need to find a maximum under some additional constraints. This is sometimes known as constrained optimisation. We will stick to the case of a function of 2 variables, although it generalises with no extra work. To begin with, let’s consider the problem of finding the maximum value of \(f(x,y)\) subject to a constraint that \(g(x,y)=0\). We require in this theory that both \(f\) and \(g\) are differentiable.

First, we’ll recall what it means for a point \((x_0,y_0)\) to be a maximum of a function \(f(x,y)\). The idea is that any move away from the point, to \((x_0+h,y_0+k)\) say, will produce a smaller value of \(f\). That is, \[f(x_0+h,y_0+k)\le f(x_0,y_0).\] Assuming the differentiability of \(f\), there is a Taylor series for \(f\): \[f(x_0+h,y_0+k)=f(x_0,y_0)+hf_x(x_0,y_0)+kf_y(x_0,y_0)+\cdots.\] Combining these two means that we need that the two partial derivatives \(f_x=\frac{\partial f}{\partial x}\) and \(f_y=\frac{\partial f}{\partial y}\) must both vanish at \((x_0,y_0)\), otherwise we could choose \(h\) and \(k\) to make \(hf_x(x_0,y_0)+kf_y(x_0,y_0)\) positive, which would increase the value of \(f\). So we need \[\begin{eqnarray*} f_x(x_0,y_0)&=&0,\\ f_y(x_0,y_0)&=&0, \end{eqnarray*}\] which is the criterion for \((x_0,y_0)\) to be a stationary point for \(f\). (There will also be a condition on the eigenvalues of the Hessian if we specify that we want to get a maximum or minimum.)

Now let’s add the constraint that we want to maximise \(f(x,y)\) subject to the constraint that \(g(x,y)=0\). We follow the argument as above, except that we are no longer allowed to pick any \(h\) and \(k\) – we can only move along the curve \(g(x,y)=0\). So we need both that \(g(x_0,y_0)=0\) and that \(g(x_0+h,y_0+k)=0\). Using the Taylor series for \(g\), we see that we must have \[\begin{equation} hg_x(x_0,y_0)+kg_y(x_0,y_0)=0. \tag{8.1} \end{equation}\] So we aren’t allowed to pick any \(h\) and \(k\) – they must be related by equation (8.1) in order that we keep to the constraint. We can now reformulate the problem: For all \(h\) and \(k\) satisfying (8.1), we need \(hf_x(x_0,y_0)+kf_y(x_0,y_0)=0\) to find the maximum of \(f\) subject to \(g=0\).

In other words, whenever \((h,k)\) is orthogonal to \((g_x(x_0,y_0),g_y(x_0,y_0))\), it must also be orthogonal to \((f_x(x_0,y_0),f_y(x_0,y_0))\). This happens when \((f_x(x_0,y_0),f_y(x_0,y_0))\) and \((g_x(x_0,y_0),g_y(x_0,y_0))\) are parallel, i.e., when \((f_x(x_0,y_0),f_y(x_0,y_0))=\lambda(g_x(x_0,y_0),g_y(x_0,y_0))\).

We can see this in the following picture. Here, the blue dotted lines indicate level curves for \(f\), and the red line is the set of \((x,y)\) where \(g(x,y)=0\). If the constraint \(g=0\) is not tangent to a level curve of \(f\) (like at either of the two marked points), we can move along \(g=0\) in a direction which increases or decreases the value of \(f\), and so we cannot be at a maximum or minimum. So we need the graph \(g=0\) to be tangent to the level curve of \(f\):

Figure 8.1: Constrained optimisation

The condition that \((f_x(x_0,y_0),f_y(x_0,y_0))=\lambda(g_x(x_0,y_0),g_y(x_0,y_0))\) implies that: \[\begin{eqnarray*} f_x(x_0,y_0)&=&\lambda g_x(x_0,y_0),\\ f_y(x_0,y_0)&=&\lambda g_y(x_0,y_0), \end{eqnarray*}\] for some \(\lambda\), or alternatively, \[\begin{eqnarray*} f_x(x_0,y_0)-\lambda g_x(x_0,y_0)&=&0,\\ f_y(x_0,y_0)-\lambda g_y(x_0,y_0)&=&0. \end{eqnarray*}\] So we consider \(\Omega=f-\lambda g\), and look for points \((x_0,y_0)\) where \(\frac{\partial\Omega}{\partial x}=0\) and \(\frac{\partial\Omega}{\partial y}=0\). The constant \(\lambda\) is known as the Lagrange multiplier. Exactly the same happens for minima or any other stationary point.

Example 8.9 If we are asked to minimise \(x_1^2+x_2^2\), subject to \(x_1+x_2=1\), we write \(\Omega=x_1^2+x_2^2-\lambda(x_1+x_2-1)\). Then \(\frac{\partial\Omega}{\partial x_1}=2x_1-\lambda\), \(\frac{\partial\Omega}{\partial x_2}=2x_2-\lambda\), and \(\frac{\partial\Omega}{\partial\lambda}=x_1+x_2-1\). It is easy to see that the solution to all these equations is at \(x_1=x_2=\frac{1}{2}\).

Similar methods work for 2 or more constraints. In fact, we can also do similar things if our constraints are inequalities, when we get “Karush-Kuhn-Tucker multipliers”.