# Orthogonal Projection and its Applications¶

Contents

## Overview¶

Orthogonal projection is a cornerstone of vector space methods, with many diverse applications

This include, but are not limited to,

- Least squares and linear regression
- Conditional expectation
- Gram–Schmidt orthogonalization
- QR decomposition
- Orthogonal polynomials
- etc

In this lecture we focus on

- key results
- standard applications such as least squares regression

### Further Reading¶

For background and foundational concepts, see our lecture on linear algebra

For more proofs and greater theoretical detail, see A Primer in Econometric Theory

For a complete set of proofs in a general setting, see, for example, [Rom05]

For an advanced treatment of projection in the context of least squares prediction, see this book chapter

## Key Definitions¶

If \(x, z \in \RR^n\) and \(\langle x, z\rangle = 0\) then \(x\) and \(z\) are said to be **orthogonal**, and we write \(x \perp z\)

Given \(S \subset \RR^n\), we call \(x \in \RR^n\) **orthogonal to** \(S\) if \(x \perp z\) for all \(z \in S\), and write \(x \perp S\)

The **orthogonal complement** of linear subspace \(S\) is the set \(S^{\perp} := \{x \in \RR^n \,:\, x \perp S\}\)

Note that \(S^\perp\) is always a linear subspace of \(\RR^n\)

To see this, fix \(x, y \in S^{\perp}\) and \(\alpha, \beta \in \RR\)

Observe that if \(z \in S\), then

Hence \(\alpha x + \beta y \in S^{\perp}\), as was to be shown

A set of vectors \(\{x_1, \ldots, x_k\} \subset \RR^n\) is called an **orthogonal set** if \(x_i \perp x_j\) whenever \(i \not= j\)

If \(\{x_1, \ldots, x_k\}\) is an orthogonal set, then the **Pythagorean Law** states that

In the case of \(k=2\) this is easy to see, since orthogonality gives

## The Orthogonal Projection Theorem¶

The problem considered by the orthogonal projection theorem (OPT) is to find the closest approximation to an arbitrary vector from within a given linear subspace

The theorem, stated below, tells us that this problem always has a unique solution, and provides a very useful characterization

**Theorem** (OPT) Given \(y \in \RR^n\) and linear subspace \(S \subset \RR^n\),
there exists a unique solution to the minimization problem

Moreover, the solution \(\hat y\) is the unique vector in \(\RR^n\) such that

- \(\hat y \in S\)
- \(y - \hat y \perp S\)

The vector \(\hat y\) is called the **orthogonal projection** of \(y\) onto \(S\)

The next figure provides some intuition

### Proof of sufficiency¶

We’ll omit the full proof but let’s at least cover sufficiency of the conditions

To this end, let \(y \in \RR^n\) and let \(S\) be a linear subspace of \(\RR^n\)

Let \(\hat y\) be a vector in \(\RR^n\) such that \(\hat y \in S\) and \(y - \hat y \perp S\)

Letting \(z\) be any other point in \(S\) and using the fact that \(S\) is a linear subspace, we have

Hence \(\| y - z \| \geq \| y - \hat y \|\), which completes the proof

### Orthogonal Projection as a Mapping¶

Holding \(S\) fixed, we have a functional relationship

By the OPT, this is a well-defined mapping from \(\RR^n\) to \(\RR^n\)

In what follows it is denoted by \(P\)

- \(P y\) represents the projection \(\hat y\)
- We write \(P = \proj S\)

The operator \(P\) is called the **orthogonal projection mapping onto** \(S\)

It is immediate from the OPT that, for any \(y \in \RR^n\),

- \(P y \in S\) and
- \(y - P y \perp S\)

From this we can deduce additional properties, such as

- \(\| y \|^2 = \| P y \|^2 + \| y - P y \|^2\) and
- \(\| P y \| \leq \| y \|\)

For example, to prove 1, observe that \(y = P y + y - P y\) and apply the Pythagorean law

#### The Residual Projection¶

Here’s another version of the OPT

**Theorem**. If \(S\) is a linear subspace of \(\RR^n\), \(P = \proj S\) and \(M = \proj S^{\perp}\), then

The next figure illustrates

## Orthonormal Bases¶

An orthogonal set \(O \subset \RR^n\) is called an **orthonormal set** if \(\| u \| = 1\) for all \(u \in O\)

Let \(S\) be a linear subspace of \(\RR^n\) and let \(O \subset S\)

If \(O\) is orthonormal and \(\Span O = S\), then \(O\) is called an **orthonormal basis** of \(S\)

It is, necessarily, a basis of \(S\) (being independent by orthogonality and the fact that no element is the zero vector)

One example of an orthonormal set is the canonical basis \(\{e_1, \ldots, e_n\}\), which forms an orthonormal basis of \(\RR^n\)

If \(\{u_1, \ldots, u_k\}\) is an orthonormal basis of linear subspace \(S\), then we have

To see this, observe that since \(x \in \Span\{u_1, \ldots, u_k\}\), we can find scalars \(\alpha_1, \ldots, \alpha_k\) s.t.

Taking the inner product with respect to \(u_i\) gives

Combining this result with (1) verifies the claim

### Projection onto an Orthonormal Basis¶

When we have an orthonormal basis for the subspace we are projecting onto, computing the projection is straightforward:

**Theorem** If \(\{u_1, \ldots, u_k\}\) is an orthonormal basis for \(S\), then

Proof: Fix \(y \in \RR^n\) and let \(P y\) be as defined in (2)

Clearly, \(P y \in S\)

We claim that \(y - P y \perp S\) also holds

It sufficies to show that \(y - P y \perp\) any basis element (why?)

This is true because

## Projection Using Matrix Algebra¶

It is not too difficult to show that if \(S\) is any linear subspace of \(\RR^n\) and \(P = \proj S\), then \(P\) is a linear function from \(\RR^n\) to \(\RR^n\)

It follows that \(P = \proj S\) can be represented as a matrix

Below we use \(P\) for both the orthogonal projection mapping and the matrix that represents it

But what does the matrix look like?

**Theorem.** If \(P = \proj S\) and the columns of \(n \times k\) matrix \(X\) form a basis of \(S\), then

Proof: Given arbitrary \(y \in \RR^n\) and \(P = X (X'X)^{-1} X'\), our claim is that

- \(P y \in S\), and
- \(y - P y \perp S\)

Here 1 is true because

An expression of the form \(X a\) is precisely a linear combination of the columns of \(X\), and hence an element of \(S\)

On the other hand, 2 is equivalent to the statement

This is true: If \(b \in \RR^K\), then

The proof is now complete

### Starting with \(X\)¶

It is common in applications to start with \(n \times k\) matrix \(X\) with linearly independent columns and let

Then the columns of \(X\) form a basis of \(S\)

From the preceding theorem, \(P = X (X' X)^{-1} X'\) projects onto \(S\)

In this context, \(P = \proj S\) is often called the **projection matrix**

- The matrix \(M = I - P\) satisfies \(M = \proj(S^{\perp})\) and is sometimes called the
**annihilator**

### The Orthonormal Case¶

As a further illustration of the last result, suppose that \(U\) is \(n \times k\) with orthonormal columns

Let \(u_i := \col U_i\) for each \(i\), let \(S := \Span U\) and let \(y \in \RR^n\)

We know that the projection of \(y\) onto \(S\) is

Since \(U\) has orthonormal columns, we have \(U' U = I\)

Hence

We have recovered our earlier result about projecting onto the span of an orthonormal basis

### Application: Overdetermined Systems of Equations¶

Consider linear system \(X b = y\) where \(y \in \RR^n\) and \(X\) is \(n \times k\) with linearly independent columns

Given \(X\) and \(y\), we seek \(b \in \RR^k\) satisfying this equation

If \(n > k\) (more equations than unknowns), then the system is said to be **overdetermined**

Intuitively, we may not be able find a \(b\) that satisfies all \(n\) equations

The best approach here is to

- Accept that an exact solution may not exist
- Look instead for an approximate solution

By approximate solution, we mean a \(b \in \RR^k\) such that \(X b\) is as close to \(y\) as possible

The next theorem shows that the solution is well defined and unique

The proof is based around the OPT

**Theorem** The unique minimizer of \(\| y - X b \|\) over \(b \in \RR^K\) is

Proof: Note that

Since \(P y\) is the orthogonal projection onto \(\Span(X)\) we have

In other words,

This is what we aimed to show

## Least Squares Regression¶

Let’s apply the theory of orthogonal projection to least squares regression

This approach provides insight on many fundamental geometric and theoretical properties of linear regression

We treat only some of the main ideas

### The Setting¶

Here’s one way to introduce linear regression

Given pairs \((x, y) \in \RR^K \times \RR\), consider choosing \(f \colon \RR^K \to \RR\) to minimize
the **risk**

If probabilities and hence \(\EE\) are unknown, we cannot solve this problem directly

However, if a sample is available, we can estimate the risk with the **empirical risk**:

Minimizing this expression is called **empirical risk minimization**

The set \(\fF\) is sometimes called the hypothesis space

The theory of statistical learning tells us we should take it to be relatively simple to prevent overfitting

If we let \(\fF\) be the class of linear functions and drop the constant \(1/N\), the problem is

This is the **linear least squares problem**

### Solution¶

To switch to matrix notation, define

and

We assume throughout that \(N > K\) and \(X\) is full column rank

If you work through the algebra, you will be able to verify that \(\| y - X b \|^2 = \sum_{n=1}^N (y_n - b' x_n)^2\)

Since increasing transforms don’t affect minimizers we have

By our results above on overdetermined systems, the solution is

Let \(P\) and \(M\) be the projection and annihilator associated with \(X\):

The **vector of fitted values** is

The **vector of residuals** is

Here are some more standard definitions:

- The
**total sum of squares**is \(:= \| y \|^2\) - The
**sum of squared residuals**is \(:= \| \hat u \|^2\) - The
**explained sum of squares**is \(:= \| \hat y \|^2\)

It’s well known that TSS = ESS + SSR always holds

We can prove this easily using the OPT

From the OPT we have \(y = \hat y + \hat u\) and \(\hat u \perp \hat y\)

Applying the Pythagorean law completes the proof

Many other standards results about least squares regression follow easily from the OPT

## Orthogonalization and Decomposition¶

Let’s return to the connection between linear independence and orthogonality touched on above

The main result of interest is a famous algorithm for generating orthonormal sets from linearly independent sets

The next section gives details

### Gram-Schmidt Orthogonalization¶

**Theorem** For each linearly independent set \(\{x_1, \ldots, x_k\} \subset \RR^n\), there exists an
orthonormal set \(\{u_1, \ldots, u_k\}\) with

Construction uses the **Gram-Schmidt orthogonalization** procedure

One version of this procedure is as follows: For \(i = 1, \ldots, k\), set

- \(S_i := \Span\{x_1, \ldots, x_i\}\) and \(M_i := \proj S_i^{\perp}\)
- \(v_i := M_{i-1} x_i\) where \(M_0\) is the identity mapping
- \(u_i := v_i / \| v_i \|\)

The sequence \(u_1, \ldots, u_k\) has the stated properties

In the exercises below you are asked to implement this algorithm and test it using projection

### QR Decomposition¶

Here’s a well known result that uses the preceding algorithm to produce a useful decomposition

**Theorem** If \(X\) is \(n \times k\) with linearly independent columns, then there exists a factorization \(X = Q R\) where

- \(R\) is \(k \times k\), upper triangular and nonsingular
- \(Q\) is \(n \times k\), with orthonormal columns

Proof sketch: Let

- \(x_j := \col_j (X)\)
- \(\{u_1, \ldots, u_k\}\) be orthonormal with same span as \(\{x_1, \ldots, x_k\}\) (to be constructed using Gram–Schmidt)
- \(Q\) be formed from cols \(u_i\)

Since \(x_j \in \Span\{u_1, \ldots, u_j\}\), we have

Some rearranging gives \(X = Q R\)

### Linear Regression via QR Decomposition¶

For \(X\) and \(y\) as above we have \(\hat \beta = (X' X)^{-1} X' y\)

Using the QR decomposition \(X = Q R\) gives

Numerical routines would in this case use the alternative form \(R \hat \beta = Q' y\) and back substitution

## Exercises¶

### Exercise 1¶

Show that, for any linear subspace \(S \subset \RR^n\), we have \(S \cap S^{\perp} = \{0\}\)

### Exercise 2¶

Let \(P = X (X' X)^{-1} X'\) and let \(M = I - P\). Show that \(P\) and \(M\) are both idempotent and symmetric. Can you give any intuition as to why they should be idempotent?