How to read this lecture...

Code should execute sequentially if run in a Jupyter notebook

Orthogonal Projections and Their Applications

Overview

Orthogonal projection is a cornerstone of vector space methods, with many diverse applications

These include, but are not limited to,

  • Least squares projection, also known as linear regression
  • Conditional expectations for multivariate normal (Gaussian) distributions
  • Gram–Schmidt orthogonalization
  • QR decomposition
  • Orthogonal polynomials
  • etc

In this lecture we focus on

  • key ideas
  • least squares regression

Further Reading

For background and foundational concepts, see our lecture on linear algebra

For more proofs and greater theoretical detail, see A Primer in Econometric Theory

For a complete set of proofs in a general setting, see, for example, [Rom05]

For an advanced treatment of projection in the context of least squares prediction, see this book chapter

Key Definitions

Assume \(x, z \in \RR^n\)

Define \(\langle x, z\rangle = \sum_i x_i z_i\)

Recall \(\|x \|^2 = \langle x, x \rangle\)

The law of cosines states that \(\langle x, z \rangle = \| x \| \| z \| \cos(\theta)\) where \(\theta\) is the angle between the vectors \(x\) and \(z\)

When \(\langle x, z\rangle = 0\), then \(\cos(\theta) = 0\) and \(x\) and \(z\) are said to be orthogonal and we write \(x \perp z\)

../_images/orth_proj_def1.png

For a linear subspace \(S \subset \RR^n\), we call \(x \in \RR^n\) orthogonal to \(S\) if \(x \perp z\) for all \(z \in S\), and write \(x \perp S\)

../_images/orth_proj_def2.png

The orthogonal complement of linear subspace \(S \subset RR^n\) is the set \(S^{\perp} := \{x \in \RR^n \,:\, x \perp S\}\)

../_images/orth_proj_def3.png

\(S^\perp\) is a linear subspace of \(\RR^n\)

  • To see this, fix \(x, y \in S^{\perp}\) and \(\alpha, \beta \in \RR\)
  • Observe that if \(z \in S\), then
\[\langle \alpha x + \beta y, z \rangle = \alpha \langle x, z \rangle + \beta \langle y, z \rangle = \alpha \times 0 + \beta \times 0 = 0\]
  • Hence \(\alpha x + \beta y \in S^{\perp}\), as was to be shown

A set of vectors \(\{x_1, \ldots, x_k\} \subset \RR^n\) is called an orthogonal set if \(x_i \perp x_j\) whenever \(i \not= j\)

If \(\{x_1, \ldots, x_k\}\) is an orthogonal set, then the Pythagorean Law states that

\[\| x_1 + \cdots + x_k \|^2 = \| x_1 \|^2 + \cdots + \| x_k \|^2\]

For example, when \(k=2\), \(x_1 \perp x_2\) implies

\[\| x_1 + x_2 \|^2 = \langle x_1 + x_2, x_1 + x_2 \rangle = \langle x_1, x_1 \rangle + 2 \langle x_2, x_1 \rangle + \langle x_2, x_2 \rangle = \| x_1 \|^2 + \| x_2 \|^2\]

Linear Independence vs Orthogonality

If \(X \subset \RR^n\) is an orthogonal set and \(0 \notin X\), then \(X\) is linearly independent

Proving this is a nice exercise

While the converse is not true, a kind of partial converse holds, as we’ll see below

The Orthogonal Projection Theorem

What vector within a linear subspace of \(\RR^n\) best approximates a given vector in \(\RR^n\)?

The next theorem provides answers this question

Theorem (OPT) Given \(y \in \RR^n\) and linear subspace \(S \subset \RR^n\), there exists a unique solution to the minimization problem

\[\hat y := \argmin_{z \in S} \|y - z\|\]

The minimizer \(\hat y\) is the unique vector in \(\RR^n\) that satisfies

  • \(\hat y \in S\)
  • \(y - \hat y \perp S\)

The vector \(\hat y\) is called the orthogonal projection of \(y\) onto \(S\)

The next figure provides some intuition

../_images/orth_proj_thm1.png

Proof of sufficiency

We’ll omit the full proof.

But we will prove sufficiency of the asserted conditions

To this end, let \(y \in \RR^n\) and let \(S\) be a linear subspace of \(\RR^n\)

Let \(\hat y\) be a vector in \(\RR^n\) such that \(\hat y \in S\) and \(y - \hat y \perp S\)

Let \(z\) be any other point in \(S\) and use the fact that \(S\) is a linear subspace to deduce

\[\| y - z \|^2 = \| (y - \hat y) + (\hat y - z) \|^2 = \| y - \hat y \|^2 + \| \hat y - z \|^2\]

Hence \(\| y - z \| \geq \| y - \hat y \|\), which completes the proof

Orthogonal Projection as a Mapping

For a linear space \(Y\) and a fixed linear subspace \(S\), we have a functional relationship

\[y \in Y\; \mapsto \text{ its orthogonal projection } \hat y \in S\]

By the OPT, this is a well-defined mapping or operator from \(\RR^n\) to \(\RR^n\)

In what follows we denote this operator by a matrix \(P\)

  • \(P y\) represents the projection \(\hat y\)
  • This is sometimes expressed as \(\hat E_S y = P y\), where \(\hat E\) denotes a wide-sense expectations operator and the subscript \(S\) indicates that we are projecting \(y\) onto the linear subspace \(S\)

The operator \(P\) is called the orthogonal projection mapping onto \(S\)

../_images/orth_proj_thm2.png

It is immediate from the OPT that for any \(y \in \RR^n\)

  1. \(P y \in S\) and
  2. \(y - P y \perp S\)

From this we can deduce additional useful properties, such as

  1. \(\| y \|^2 = \| P y \|^2 + \| y - P y \|^2\) and
  2. \(\| P y \| \leq \| y \|\)

For example, to prove 1, observe that \(y = P y + y - P y\) and apply the Pythagorean law

Orthogonal Complement

Let \(S \subset \RR^n\).

The orthogonal complement of \(S\) is the linear subspace \(S^{\perp}\) that satisfies \(x_1 \perp x_2\) for every \(x_1 \in S\) and \(x_2 \in S^{\perp}\)

Let \(Y\) be a linear space with linear subspace \(S\) and its orthogonal complement \(S^{\perp}\)

We write

\[Y = S \oplus S^{\perp}\]

to indicate that for every \(y \in Y\) there is unique \(x_1 \in S\) and a unique \(x_2 \in S^{\perp}\) such that \(y = x_1 + x_2\).

Moreover, \(x_1 = \hat E_S y\) and \(x_2 = y - \hat E_S y\)

This amounts to another version of the OPT:

Theorem. If \(S\) is a linear subspace of \(\RR^n\), \(\hat E_S y = P y\) and \(\hat E_{S^{\perp}} y = M y\), then

\[ P y \perp M y \quad \text{and} \quad y = P y + M y \quad \text{for all } \, y \in \RR^n\]

The next figure illustrates

../_images/orth_proj_thm3.png

Orthonormal Basis

An orthogonal set of vectors \(O \subset \RR^n\) is called an orthonormal set if \(\| u \| = 1\) for all \(u \in O\)

Let \(S\) be a linear subspace of \(\RR^n\) and let \(O \subset S\)

If \(O\) is orthonormal and \(\Span O = S\), then \(O\) is called an orthonormal basis of \(S\)

\(O\) is necessarily a basis of \(S\) (being independent by orthogonality and the fact that no element is the zero vector)

One example of an orthonormal set is the canonical basis \(\{e_1, \ldots, e_n\}\) that forms an orthonormal basis of \(\RR^n\), where \(e_i\) is the :math:`i`th unit vector

If \(\{u_1, \ldots, u_k\}\) is an orthonormal basis of linear subspace \(S\), then

\[x = \sum_{i=1}^k \langle x, u_i \rangle u_i \quad \text{for all} \quad x \in S\]

To see this, observe that since \(x \in \Span\{u_1, \ldots, u_k\}\), we can find scalars \(\alpha_1, \ldots, \alpha_k\) that verify

(1)\[x = \sum_{j=1}^k \alpha_j u_j\]

Taking the inner product with respect to \(u_i\) gives

\[\langle x, u_i \rangle = \sum_{j=1}^k \alpha_j \langle u_j, u_i \rangle = \alpha_i\]

Combining this result with (1) verifies the claim

Projection onto an Orthonormal Basis

When the subspace onto which are projecting is orthonormal, computing the projection simplifies:

Theorem If \(\{u_1, \ldots, u_k\}\) is an orthonormal basis for \(S\), then

(2)\[P y = \sum_{i=1}^k \langle y, u_i \rangle u_i, \quad \forall \; y \in \RR^n\]

Proof: Fix \(y \in \RR^n\) and let \(P y\) be defined as in (2)

Clearly, \(P y \in S\)

We claim that \(y - P y \perp S\) also holds

It sufficies to show that \(y - P y \perp\) any basis vector \(u_i\) (why?)

This is true because

\[\left\langle y - \sum_{i=1}^k \langle y, u_i \rangle u_i, u_j \right\rangle = \langle y, u_j \rangle - \sum_{i=1}^k \langle y, u_i \rangle \langle u_i, u_j \rangle = 0\]

Projection Using Matrix Algebra

Let \(S\) be a linear subspace of \(\RR^n\) and let \(y \in \RR^n\).

We want to compute the matrix \(P\) that verifies

\[\hat E_S y = P y\]

Evidently \(Py\) is a linear function from \(y \in \RR^n\) to \(P y \in \RR^n\)

This reference is useful <https://en.wikipedia.org/wiki/Linear_map#Matrices>`_

Theorem. Let the columns of \(n \times k\) matrix \(X\) form a basis of \(S\). Then

\[P = X (X'X)^{-1} X'\]

Proof: Given arbitrary \(y \in \RR^n\) and \(P = X (X'X)^{-1} X'\), our claim is that

  1. \(P y \in S\), and
  2. \(y - P y \perp S\)

Claim 1 is true because

\[P y = X (X' X)^{-1} X' y = X a \quad \text{when} \quad a := (X' X)^{-1} X' y\]

An expression of the form \(X a\) is precisely a linear combination of the columns of \(X\), and hence an element of \(S\)

Claim 2 is equivalent to the statement

\[y - X (X' X)^{-1} X' y \, \perp\, X b \quad \text{for all} \quad b \in \RR^K\]

This is true: If \(b \in \RR^K\), then

\[(X b)' [y - X (X' X)^{-1} X' y] = b' [X' y - X' y] = 0\]

The proof is now complete

Starting with \(X\)

It is common in applications to start with \(n \times k\) matrix \(X\) with linearly independent columns and let

\[S := \Span X := \Span \{\col_1 X, \ldots, \col_k X \}\]

Then the columns of \(X\) form a basis of \(S\)

From the preceding theorem, \(P = X (X' X)^{-1} X' y\) projects \(y\) onto \(S\)

In this context, \(P\) is often called the projection matrix

  • The matrix \(M = I - P\) satisfies \(M y = \hat E_{S^{\perp}} y\) and is sometimes called the annihilator matrix

The Orthonormal Case

Suppose that \(U\) is \(n \times k\) with orthonormal columns

Let \(u_i := \col U_i\) for each \(i\), let \(S := \Span U\) and let \(y \in \RR^n\)

We know that the projection of \(y\) onto \(S\) is

\[P y = U (U' U)^{-1} U' y\]

Since \(U\) has orthonormal columns, we have \(U' U = I\)

Hence

\[P y = U U' y = \sum_{i=1}^k \langle u_i, y \rangle u_i\]

We have recovered our earlier result about projecting onto the span of an orthonormal basis

Application: Overdetermined Systems of Equations

Let \(y \in \RR^n\) and let \(X\) is \(n \times k\) with linearly independent columns

Given \(X\) and \(y\), we seek \(b \in \RR^k\) satisfying the system of linear equations \(X b = y\)

If \(n > k\) (more equations than unknowns), then \(b\) is said to be overdetermined

Intuitively, we may not be able find a \(b\) that satisfies all \(n\) equations

The best approach here is to

  • Accept that an exact solution may not exist
  • Look instead for an approximate solution

By approximate solution, we mean a \(b \in \RR^k\) such that \(X b\) is as close to \(y\) as possible

The next theorem shows that the solution is well defined and unique

The proof uses the OPT

Theorem The unique minimizer of \(\| y - X b \|\) over \(b \in \RR^K\) is

\[\hat \beta := (X' X)^{-1} X' y\]

Proof: Note that

\[X \hat \beta = X (X' X)^{-1} X' y = P y\]

Since \(P y\) is the orthogonal projection onto \(\Span(X)\) we have

\[\| y - P y \| \leq \| y - z \| \text{ for any } z \in \Span(X)\]

Because \(Xb \in \Span(X)\)

\[\| y - X \hat \beta \| \leq \| y - X b \| \text{ for any } b \in \RR^K\]

This is what we aimed to show

Least Squares Regression

Let’s apply the theory of orthogonal projection to least squares regression

This approach provides insights about many geometric properties of linear regression

We treat only some examples

Squared risk measures

Given pairs \((x, y) \in \RR^K \times \RR\), consider choosing \(f \colon \RR^K \to \RR\) to minimize the risk

\[R(f) := \EE [(y - f(x))^2]\]

If probabilities and hence \(\EE\) are unknown, we cannot solve this problem directly

However, if a sample is available, we can estimate the risk with the empirical risk:

\[\min_{f \in \fF} \frac{1}{N} \sum_{n=1}^N (y_n - f(x_n))^2\]

Minimizing this expression is called empirical risk minimization

The set \(\fF\) is sometimes called the hypothesis space

The theory of statistical learning tells us that to prevent overfitting we should take the set \(\fF\) to be relatively simple

If we let \(\fF\) be the class of linear functions \(1/N\), the problem is

\[\min_{b \in \RR^K} \; \sum_{n=1}^N (y_n - b' x_n)^2\]

This is the sample linear least squares problem

Solution

Define the matrices

\[\begin{split}y := \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right), \quad x_n := \left( \begin{array}{c} x_{n1} \\ x_{n2} \\ \vdots \\ x_{nK} \end{array} \right) = \text{ $n$-th obs on all regressors}\end{split}\]

and

\[\begin{split}X := \left( \begin{array}{c} x_1' \\ x_2' \\ \vdots \\ x_N' \end{array} \right) :=: \left( \begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1K} \\ x_{21} & x_{22} & \cdots & x_{2K} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{NK} \end{array} \right)\end{split}\]

We assume throughout that \(N > K\) and \(X\) is full column rank

If you work through the algebra, you will be able to verify that \(\| y - X b \|^2 = \sum_{n=1}^N (y_n - b' x_n)^2\)

Since monotone transforms don’t affect minimizers, we have

\[\argmin_{b \in \RR^K} \sum_{n=1}^N (y_n - b' x_n)^2 = \argmin_{b \in \RR^K} \| y - X b \|\]

By our results about overdetermined linear systems of equations, the solution is

\[\hat \beta := (X' X)^{-1} X' y\]

Let \(P\) and \(M\) be the projection and annihilator associated with \(X\):

\[P := X (X' X)^{-1} X' \quad \text{and} \quad M := I - P\]

The vector of fitted values is

\[\hat y := X \hat \beta = P y\]

The vector of residuals is

\[\hat u := y - \hat y = y - P y = M y\]

Here are some more standard definitions:

  • The total sum of squares is \(:= \| y \|^2\)
  • The sum of squared residuals is \(:= \| \hat u \|^2\)
  • The explained sum of squares is \(:= \| \hat y \|^2\)
TSS = ESS + SSR

We can prove this easily using the OPT

From the OPT we have \(y = \hat y + \hat u\) and \(\hat u \perp \hat y\)

Applying the Pythagorean law completes the proof

Orthogonalization and Decomposition

Let’s return to the connection between linear independence and orthogonality touched on above

A result of much interest is a famous algorithm for constructing orthonormal sets from linearly independent sets

The next section gives details

Gram-Schmidt Orthogonalization

Theorem For each linearly independent set \(\{x_1, \ldots, x_k\} \subset \RR^n\), there exists an orthonormal set \(\{u_1, \ldots, u_k\}\) with

\[\Span \{x_1, \ldots, x_i\} = \Span \{u_1, \ldots, u_i\} \quad \text{for} \quad i = 1, \ldots, k\]

The Gram-Schmidt orthogonalization procedure constructs an orthogonal set \(\{ u_1, u_2, \ldots, u_n\}\)

One description of this procedure is as follows:

  • For \(i = 1, \ldots, k\), form \(S_i := \Span\{x_1, \ldots, x_i\}\) and \(S_i^{\perp}\)
  • Set \(v_1 = x_1\)
  • For \(i \geq 2\) set \(v_i := \hat E_{S_{i-1}^{\perp}} x_i\) and \(u_i := v_i / \| v_i \|\)

The sequence \(u_1, \ldots, u_k\) has the stated properties

A Gram-Schmidt orthogonalization construction is a key idea behind the Kalman filter described in A First Look at the Kalman filter

In some exercises below you are asked to implement this algorithm and test it using projection

QR Decomposition

The following result uses the preceding algorithm to produce a useful decomposition

Theorem If \(X\) is \(n \times k\) with linearly independent columns, then there exists a factorization \(X = Q R\) where

  • \(R\) is \(k \times k\), upper triangular, and nonsingular
  • \(Q\) is \(n \times k\) with orthonormal columns

Proof sketch: Let

  • \(x_j := \col_j (X)\)
  • \(\{u_1, \ldots, u_k\}\) be orthonormal with same span as \(\{x_1, \ldots, x_k\}\) (to be constructed using Gram–Schmidt)
  • \(Q\) be formed from cols \(u_i\)

Since \(x_j \in \Span\{u_1, \ldots, u_j\}\), we have

\[x_j = \sum_{i=1}^j \langle u_i, x_j \rangle u_i \quad \text{for } j = 1, \ldots, k\]

Some rearranging gives \(X = Q R\)

Linear Regression via QR Decomposition

For matrices \(X\) and \(y\) that overdetermine \(beta\) in the linear equation system \(y = X \beta\), we found the least squares approximator \(\hat \beta = (X' X)^{-1} X' y\)

Using the QR decomposition \(X = Q R\) gives

\[\begin{split}\begin{aligned} \hat \beta & = (R'Q' Q R)^{-1} R' Q' y \\ & = (R' R)^{-1} R' Q' y \\ & = R^{-1} (R')^{-1} R' Q' y = R^{-1} Q' y \end{aligned}\end{split}\]

Numerical routines would in this case use the alternative form \(R \hat \beta = Q' y\) and back substitution

Exercises

Exercise 1

Show that, for any linear subspace \(S \subset \RR^n\), \(S \cap S^{\perp} = \{0\}\)

Exercise 2

Let \(P = X (X' X)^{-1} X'\) and let \(M = I - P\). Show that \(P\) and \(M\) are both idempotent and symmetric. Can you give any intuition as to why they should be idempotent?

Solutions

Exercise 1

If \(x \in S\) and \(x \in S^\perp\), then we have in particular that \(\langle x, x \rangle = 0\). But then \(x = 0\).

Exercise 2

Symmetry and idempotence of \(M\) and \(P\) can be established using standard rules for matrix algebra. The intuition behind idempotence of \(M\) and \(P\) is that both are orthogonal projections. After a point is projected into a given subspace, applying the projection again makes no difference. (A point inside the subspace is not shifted by orthogonal projection onto that space because it is already the closest point in the subspace to itself.)

Exercise 3

Here’s a function that computes the orthonormal vectors using the GS algorithm given in the lecture.

"""
Implements Gram-Schmidt orthogonalization.

Parameters
----------
X : an n x k array with linearly independent columns

Returns
-------
U : an n x k array with orthonormal columns

"""
function gram_schmidt(X)

    n, k = size(X)
    U = Array{Float64}(n, k)
    I = eye(n)

    # The first col of U is just the normalized first col of X
    v1 = X[:,1]
    U[:,1] = v1 / norm(v1)

    for i in 2:k
        # Set up
        b = X[:,i]        # The vector we're going to project
        Z = X[:, 1:i-1]   # first i-1 columns of X

        # Project onto the orthogonal complement of the col span of Z
        M = I - Z * inv(Z' * Z) * Z'
        u = M * b

        # Normalize
        U[:,i] = u / norm(u)
    end

    return U
end
gram_schmidt

Here are the arrays we’ll work with

y = [1 3 -3]'
X = [1 0; 0 -6; 2 2];

First let’s do ordinary projection of \(y\) onto the basis spanned by the columns of \(X\).

Py1 = X * inv(X' * X) * X' * y
3×1 Array{Float64,2}:
 -0.565217
  3.26087
 -2.21739

Now let’s orthogonalize first, using Gram–Schmidt:

U = gram_schmidt(X)
3×2 Array{Float64,2}:
 0.447214  -0.131876
 0.0       -0.989071
 0.894427   0.065938

Now we can project using the orthonormal basis and see if we get the same thing:

Py2 = U * U' * y
3×1 Array{Float64,2}:
 -0.565217
  3.26087
 -2.21739

The result is the same. To complete the exercise, we get an orthonormal basis by QR decomposition and project once more.

Q, R = qr(X)
(
[-0.447214 -0.131876; 0.0 -0.989071; -0.894427 0.065938],

[-2.23607 -1.78885; 0.0 6.0663])
Py3 = Q * Q' * y
3×1 Array{Float64,2}:
 -0.565217
  3.26087
 -2.21739

Again, the result is the same