Tuesday, December 11, 2012

Conditional Expectation as Projection

Introduction

In these pages, I have tried to distill and illustrate the keys concepts needed in statistical signal processing, and in this section, we will cover the most fundamental statistical result that underpins statistical signal processing. The last sections on conditional expectation and projection are prerequisites for what follows. Please review those before continuing.

Inner Product for Random Variables

From our previous work on projection for vectors in $\mathbb{R}^n$, we have a good geometric grasp on how projection is related to minimum mean squared error (MMSE). It turns out by one abstract step, we can carry all of our geometric interpretations to the space of random variables.

For example, we previously noted that at the point of projection, we had

$$ ( \mathbf{y} - \mathbf{v}_{opt} )^T \mathbf{v} = 0$$

which by noting the inner product slightly more abstractly as $\langle\mathbf{x},\mathbf{y} \rangle = \mathbf{x}^T \mathbf{y}$, we can express as

$$\langle \mathbf{y} - \mathbf{v}_{opt},\mathbf{v} \rangle = 0 $$

and, in fact, by defining the inner product for the random variables $X$ and $Y$ as

$$ \langle X,Y \rangle = \mathbb{E}(X Y)$$

we remarkably have the same relationship:

$$\langle X-h_{opt}(Y),Y \rangle = 0 $$

which holds not for vectors in $\mathbb{R}^n$, but for random variables $X$ and $Y$ and functions of those random variables. Exactly why this is true is technical, but it turns out that one can build up the entire theory of probability this way (see Nelson,1987), by using the expectation as an inner product.

Furthermore, by abstracting out the inner product concept, we have drawn a clean line between MMSE optimization problems, geometry, and random variables. That's a lot of mileage to get a out of an abstraction and it is key to everything we pursue in statistical signal processing because now we can shift between these interpretations to address real problems. Soon, we'll see how to do this with some examples, but first we will collect one staggering result that flows naturally from this abstraction.

Conditional Expectation as Projection

Previously, we noted that the conditional expectation is the minimum mean squared error (MMSE) solution to the following problem:

$$ \min_h \int_{\mathbb{R}} (x - h(y) )^2 dx $$

with the minimizing $h_{opt}(Y) $ as

$$ h_{opt}(Y) = \mathbb{E}(X|Y) $$

which is another way of saying that among all possible functions $h(Y)$, the one that minimizes the MSE is $ \mathbb{E}(X|Y)$ (see appendix for a quick proof). From our discussion on projection, we noted that these MMSE solutions can be thought of as projections onto a subspace that characterizes $Y$. For example, we previously noted that at the point of projection, we have

$$\langle X-h_{opt}(Y),Y \rangle = 0 $$

but since we know that the MMSE solution

$$ h_{opt}(Y) = \mathbb{E}(X|Y) $$

we have by direct substitution,

$$ \mathbb{E}( X-\mathbb{E}(X|Y), Y) = 0$$

That last step seems pretty innocuous, but it is the step that ties MMSE to conditional expectation to the inner project abstraction, and in so doing, reveals the conditional expectation to be a projection operator for random variables. Before we develop this further, let's grab some quick dividends:

From the previous equation, by linearity of the expectation, we may obtain,

$$ \mathbb{E}(X Y) = \mathbb{E}(Y \mathbb{E}(X|Y))$$

which we could have found by using the formal definition of conditional expectation,

\begin{equation} \mathbb{E}(X|Y) = \int_{\mathbb{R}^2} x \frac{f_{X,Y}(x,y)}{f_Y(y)} dx dy \end{equation}

and direct integration,

$$ \mathbb{E}(Y \mathbb{E}(X|Y))= \int_{\mathbb{R}} y \int_{\mathbb{R}} x \frac{f_{X,Y}(x,y)}{f_Y(y)} f_Y(y) dx dy =\int_{\mathbb{R}^2} x y f_{X,Y}(x,y) dx dy =\mathbb{E}( X Y) $$

which is good to know, but not very geometrically intuitive. And this lack of geometric intuition makes it hard to apply these concepts and keep track of these relationships.

We can keep pursuing this analogy and obtain the length of the error term as

$$ \langle X-h_{opt}(Y),X-h_{opt}(Y) \rangle = \langle X,X \rangle - \langle h_{opt}(Y),h_{opt}(Y) \rangle $$

and then by substituting all the notation we obtain

\begin{equation} \mathbb{E}(X- \mathbb{E}(X|Y))^2 = \mathbb{E}(X)^2 - \mathbb{E}(\mathbb{E}(X|Y) )^2
\end{equation}

which would be tough to compute by direct integration.

We recognize that $\mathbb{E}(X|Y)$ is in fact a projection operator. Recall previously that we noted that the projection operator is idempotent, which means that once we project something onto a subspace, further projections essentially do nothing. Well, in the space of random variables, $\mathbb{E}(X|\cdot$) is the idempotent projection as we can show by noting that

$$ h_{opt} = \mathbb{E}(X|Y)$$

is purely a function of $Y$, so that

$$ \mathbb{E}(h_{opt}(Y)|Y) = h_{opt}(Y) $$

since $Y$ is fixed and this is the statement of idempotency. Thus, conditional expectation is the corresponding projection operator in this space of random variables. With this happy fact, we can continue to carry over our geometric interpretations of projections for vectors ($\mathbf{v}$) into random variables ( $X$ ).

Now that we have just stuffed our toolbox, let's consider some example conditional expectations obtained by using brute force to find the optimal MMSE function $h_{opt}$ as well as by using the definition of the conditional expectation.

Example

Suppose we have a random variable, $X$, then what constant is closest to $X$ in the mean-squared-sense (MSE)? In other words, which $c$ minimizes the following:

$$ J = \mathbb{E}( X - c )^2 $$

we can work this out as

$$ \mathbb{E}( X - c )^2 = \mathbb{E}(c^2 - 2 c X + X^2) = c^2-2 c \mathbb{E}(X) + \mathbb{E}(X^2) $$

and then take the first derivative with respect to $c$ and solve:

$$ c_{opt}=\mathbb{E}(X) $$

Remember that $X$ can take on all kinds of values, but this says that the closest number to $X$ in the MSE sense is $\mathbb{E}(X)$.

Coming at this same problem using our inner product, we know that at the point of projection

$$ \mathbb{E}((X-c_{opt}) 1) = 0$$

where the $1$ represents the space of constants (i.e. $c \cdot 1 $) we are projecting on. This, by linearity of the expectation, gives

$$ c_{opt}=\mathbb{E}(X) $$

Because $\mathbb{E}(X|Y)$ is the projection operator, with $Y=\Omega$ (the entire underlying probability space), we have, using the definintion of conditional expectation:

$$ \mathbb{E}(X|Y=\Omega) = \mathbb{E}(X) $$

Thus, we just worked the same problem three ways (optimization, inner product, projection).

Example

Let's consider the following example with probability density $f_{X,Y}= x + y $ where $(x,y) \in [0,1]^2$ and compute the conditional expectation straight from the definition:

$$ \mathbb{ E}(X|Y) = \int_0^1 x \frac{f_{X,Y}(x,y)}{f_Y(y)} dx= \int_0^1 x \frac{x+y}{y+1/2} dx =\frac{3 y + 2}{6 y + 3} $$

That was pretty easy because the density function was so simple. Now, let's do it the hard way by going directly for the MMSE solution $h(Y)$. Then,

$$ \min_h \int_0^1\int_0^1 (x - h(y) )^2 f_{X,Y}(x,y)dx dy = \min_h \int_0^1 y h^2 {\left (y \right )} - y h{\left (y \right )} + \frac{1}{3} y + \frac{1}{2} h^{2}{\left (y \right )} - \frac{2}{3} h{\left (y \right )} + \frac{1}{4} dy $$

Now we have to find a function $h$ that is going to minimize this. Solving for a function, as opposed to solving for a number, is generally very, very hard, but because we are integrating over a finite interval, we can use the Euler-Lagrange method from variational calculus to take the derivative of the integrand with respect to the function $h(y)$ and set it to zero. Euler-Lagrange methods will be the topic of a later section, but for now we just want the result, namely,

$$ 2 y h{\left (y \right )} - y + h{\left (y \right )} - \frac{2}{3} =0 $$

Solving this gives

$$ h_{opt}(y)= \frac{3 y + 2}{6 y + 3} $$

Finally, we can try solving this using our inner product as

$$ \mathbb{E}( (X - h(Y)) Y ) = 0$$

Writing this out gives,

$$ \int_0^1 \int_0^1 (x-h(y))(x+y) dx dy = \int_0^1 \frac{y\,\left(2 + 3\,y - 3\,\left( 1 + 2\,y \right) \,h(y)\right) }{6} dy = 0$$

and if this is zero everywhere, then the integrand must be zero,

$$ 2 y + 3 y^2 - 3 y h(y) - 6 y^2 h(y)=0 $$

and solving this for $h(y)$ gives the same solution:

$$ h_{opt}(y)= \frac{3 y + 2}{6 y + 3} $$

Thus, doing it by the definition, optimization, or inner product gives us the same answer; but, in general, no method is necessarily easier because they both involve potentially difficult or impossible integration, optimization, or functional equation solving. The point is that now that we have a deep toolbox, we can pick and choose which tools we want to apply for different problems.

Before we leave this example, let's use sympy to verify the length of the error function we found earlier:

\begin{equation} \mathbb{E}(X- \mathbb{E}(X|Y))^2 = \mathbb{E}(X)^2 - \mathbb{E}(\mathbb{E}(X|Y) )^2
\end{equation}

that is based on the Pythagorean theorem.

In [1]:
from sympy.abc import y,x
from sympy import integrate, simplify

fxy = x + y                 # joint density
fy = integrate(fxy,(x,0,1)) # marginal density
fx = integrate(fxy,(y,0,1)) # marginal density

h = (3*y+2)/(6*y+3) # conditional expectation
LHS=integrate((x - h)**2 *fxy, (x,0,1),(y,0,1)) # from the definition
RHS=integrate( (x)**2 *fx, (x,0,1)) - integrate( (h)**2 *fy, (y,0,1)) # using Pythagorean theorem
print simplify(LHS-RHS)==0
True

Summary

In this section, we have pulled together all the projection and least-squares optimization ideas from the previous posts to draw a clean line between our geometric notions of projection from vectors in $\mathbb{R}^n$ to general random variables. This resulted in the remarkable realization that the conditional expectation is in fact a projection operator for random variables. The key idea is that because we have these relationships, we can approach difficult problems in multiple ways, depending on which way is more intuitive or tractable in a particular situation. In these pages, we will again and again come back to these intuitions because they form the backbone of statistical signal processing.

In the next section, we will have a lot of fun with these ideas working out some examples that are usually solved using more general tools from measure theory.

Note that the book by Mikosch (1998) has some excellent sections covering much of this material in more detail. Mikosch has a very geometric view of the material as well.

Appendix

Proof of MMSE of Conditional expectation by Optimization

$$ \min_h \int_\mathbb{R^2} | X - h(Y) |^2 f_{x,y}(X,Y) dx dy $$

$$ \min_h \int_\mathbb{R^2} | X |^2 f_{x,y}(X,Y) dx dy + \int_\mathbb{R^2} | h(Y) |^2 f_{x,y}(X,Y) dx dy - \int_\mathbb{R^2} 2 X h(Y) f_{x,y}(X,Y) dx dy $$

Now, we want to maximize the following:

$$ \max_h \int_\mathbb{R^2} X h(Y) f_{x,y}(X,Y) dx dy $$

Breaking up the integral using the definition of conditional expectation

$$ \max_h \int_\mathbb{R} \left(\int_\mathbb{R} X f_{x|y}(X|Y) dx \right)h(Y) f_Y(Y) dy $$

$$ \max_h \int_\mathbb{R} \mathbb{E}(X|Y) h(Y)f_Y(Y) dy $$

From properties of the Cauchy-Schwarz inequality, we know that the maximum happens when $h_{opt}(Y) = \mathbb{E}(X|Y)$, so we have found the optimal $h(Y)$ function as :

$$ h_{opt}(Y) = \mathbb{E}(X|Y)$$

which shows that the optimal function is the conditional expectation.

References

This post was created using the nbconvert utility from the source IPython Notebook which is available for download from the main github site for this blog.

Bibliography

  • Nelson, Edward. Radically Elementary Probability Theory.(AM-117). Vol. 117. Princeton University Press, 1987.

  • Mikosch, Thomas. Elementary stochastic calculus with finance in view. Vol. 6. World Scientific Publishing Company Incorporated, 1998.

Friday, November 30, 2012

Projection in Multiple Dimensions

In this section, we extend from the one-dimensional subspace to the more general two-dimensional subspace. This means that there are two vectors, $\mathbf{v}_1$ and $\mathbf{v}_2$ that are not colinear and that span the subspace. In the previous case, we had only one vector ( $\mathbf{v}$), so we had a one-dimensional subspace, but now that we have two vectors, we have a two-dimensional subspace (i.e. a plane). The extension from the two-dimensional subspace to the n-dimensional subspace follows the same argument but introduces more notation than we need so we'll stick with the two-dimensional case for awhile. For the two-dimensional case, the optimal MMSE solution has the form

$$ \hat{\mathbf{y}} = \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 \in \mathbb{R}^m $$

where $\mathbf{y}$ exists in the m-dimensional space of real numbers. We want to project this vector onto the two m-dimensional $\mathbf{v}_i$ vectors. Here, the orthogonality requirement extends as

$$ \langle \mathbf{y} - \alpha_1 \mathbf{v}_1 -\alpha_2 \mathbf{v}_2 , \mathbf{v}_1\rangle= 0 $$

and

$$ \langle \mathbf{y} - \alpha_1 \mathbf{v}_1 -\alpha_2 \mathbf{v}_2 , \mathbf{v}_2\rangle= 0 $$

Recall that for vectors, we have

$$ \langle \mathbf{x} , \mathbf{y}\rangle = \mathbf{x}^T \mathbf{y} \in \mathbb{R}$$

This leads to the linear system of equations:

$$ \begin{eqnarray} \langle \mathbf{y}, \mathbf{v}_1\rangle = & \alpha_1 \langle \mathbf{v}_1, \mathbf{v}_1\rangle & +\alpha_2 \langle \mathbf{v}_1, \mathbf{v}_2\rangle \\ \langle \mathbf{y}, \mathbf{v}_2\rangle = & \alpha_1 \langle \mathbf{v}_1, \mathbf{v}_2\rangle & +\alpha_2 \langle \mathbf{v}_2, \mathbf{v}_2\rangle \end{eqnarray} $$

which can be written in matrix form as

$$ \left[ \begin{array}{c} \langle \mathbf{y}, \mathbf{v}_1\rangle \\ \langle \mathbf{y}, \mathbf{v}_2\rangle \\ \end{array} \right] = \left[ \begin{array}{cc} \langle \mathbf{v}_1, \mathbf{v}_1\rangle & \langle \mathbf{v}_1, \mathbf{v}_2\rangle \\ \langle \mathbf{v}_1, \mathbf{v}_2\rangle & \langle \mathbf{v}_2, \mathbf{v}_2\rangle \\ \end{array} \right] \left[ \begin{array}{c} \alpha_1 \\ \alpha_2 \\ \end{array} \right]$$

which can be further reduced by stacking the columns into

$$ \mathbf{V} = \left[ \mathbf{v}_1, \mathbf{v}_2 \right] \in \mathbb{R}^{m \times 2} $$

and

$$ \boldsymbol{\alpha}= \left[ \alpha_1, \alpha_2\right]^T \in \mathbb{R}^{2}$$

which gives

$$ \mathbf{V}^T \mathbf{y} = (\mathbf{V}^T \mathbf{V}) \boldsymbol{\alpha} $$

Note that by writing this using vector notation, we have implicitly generalized beyond two dimensions since there is nothing to stop from stacking $\mathbf{V}$ with more column vectors to create a larger subspace. By solving we obtain,

$$ \boldsymbol{\alpha} = (\mathbf{V}^T \mathbf{V})^{-1} \mathbf{V}^T \mathbf{y} $$

and so the optimal solution is then,

$$ \hat{\mathbf{y}} = \mathbf{V} \boldsymbol{\alpha} \in \mathbb{R}^m $$

Note that the existence of the inverse is guaranteed by the non-co-linearity of the $\mathbf{v}_i$ vectors. Whether or not that inverse is numerically stable is another issue.

Then, we can combine these to obtain

$$ \hat{\mathbf{y}} = \mathbf{V} (\mathbf{V}^T \mathbf{V})^{-1} \mathbf{V}^T \mathbf{y} $$

when then makes the projection operator for this case:

$$ \mathbf{P}_{V}= \mathbf{V} (\mathbf{V}^T \mathbf{V})^{-1} \mathbf{V}^T \in \mathbb{R}^{m \times m} $$

As a quick check, we can see this reduce to the 1-dimensional case by setting

$$ \mathbf{V}= \mathbf{v} \in \mathbb{R}^m$$

so then,

$$ \mathbf{P}_{v}= \mathbf{v} \frac{1}{\mathbf{v}^T \mathbf{v}} \mathbf{v}^T $$

which matches our previous result. The point of all these manipulations is that we can construct another projection operator with all the MMSE properties we had before, but now in a bigger subspace. We can further verify the idempotent property of projection matrices by checking that

$$ \mathbf{P}_V \mathbf{P}_V = \mathbf{P}_V$$

The following graphic shows that when we project the three dimensional $\mathbf{y}$ vector onto the plane, which is spanned by the two $\mathbf{v}_i$ vectors, we obtain the MMSE solution where the sphere is tangent to the plane. The point of tangency is the point $\hat{\mathbf{y}} $ which is the MMSE solution.

In [2]:
#http://stackoverflow.com/questions/10374930/matplotlib-annotating-a-3d-scatter-plot

from mpl_toolkits.mplot3d import proj3d
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
fig.set_size_inches([8,8])

ax = fig.add_subplot(111, projection='3d')

ax.set_aspect(1)
ax.set_xlim([0,2])
ax.set_ylim([0,2])
ax.set_zlim([0,2])
ax.set_aspect(1)
ax.set_xlabel('x-axis',fontsize=16)
ax.set_ylabel('y-axis',fontsize=16)
ax.set_zlabel('z-axis',fontsize=16)

y = matrix([1,1,1]).T 
V = matrix([[1,0.25], # columns are v_1, v_2
            [0,0.50],
            [0,0.00]])

alpha=inv(V.T*V)*V.T*y # optimal coefficients
P = V*inv(V.T*V)*V.T
yhat = P*y         # approximant


u = np.linspace(0, 2*np.pi, 100)
v = np.linspace(0, np.pi, 100)

xx = np.outer(np.cos(u), np.sin(v))
yy = np.outer(np.sin(u), np.sin(v))
zz = np.outer(np.ones(np.size(u)), np.cos(v))

sphere=ax.plot_surface(xx+y[0,0], yy+y[1,0], zz+y[2,0],  
                       rstride=4, cstride=4, color='gray',alpha=0.3,lw=0.25)

ax.plot3D([y[0,0],0],[y[1,0],0],[y[2,0],0],'r-',lw=3)
ax.plot3D([y[0,0]],[y[1,0]],[y[2,0]],'ro')

ax.plot3D([V[0,0],0],[V[1,0],0],[V[2,0],0],'b-',lw=3)
ax.plot3D([V[0,0]],[V[1,0]],[V[2,0]],'bo')
ax.plot3D([V[0,1],0],[V[1,1],0],[V[2,1],0],'b-',lw=3)
ax.plot3D([V[0,1]],[V[1,1]],[V[2,1]],'bo')

ax.plot3D([yhat[0,0],0],[yhat[1,0],0],[yhat[2,0],0],'g--',lw=3)
ax.plot3D([yhat[0,0]],[yhat[1,0]],[yhat[2,0]],'go')


x2, y2, _ = proj3d.proj_transform(y[0,0],y[1,0],y[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{y}$", 
    xy = (x2, y2), xytext = (-20, 20), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(yhat[0,0],yhat[1,0],yhat[2,0], ax.get_proj())
ax.annotate(
    "$\hat{\mathbf{y}}$", 
    xy = (x2, y2), xytext = (-40, 10), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,0],V[1,0],V[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_1$", 
    xy = (x2, y2), xytext = (120, 10), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,1],V[1,1],V[2,1], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_2$", 
    xy = (x2, y2), xytext = (-30, 30), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

Weighted Distances

As before, we can easily extend this projection operator to cases where the measure of distance between $\mathbf{y}$ and the subspace $\mathbf{v}$ is weighted (i.e. non-uniform). We can accomodate these weighted distances by re-writing the projection operator as

$$ \mathbf{P}_{V} = \mathbf{V} ( \mathbf{V}^T \mathbf{Q V})^{-1} \mathbf{V}^T \mathbf{Q} $$

where $\mathbf{Q}$ is positive definite matrix. Earlier, we started with a point $\mathbf{y}$ and inflated a sphere centered at $\mathbf{y}$ until it just touched the plane defined by $\mathbf{v}_i$ and this point was closest point on the subspace to $\mathbf{y}$. In the general case with a weighted distance except now we inflate an ellipsoid, not a sphere, until the ellipsoid touches the line.

The code and figure below illustrate what happens using the weighted $ \mathbf{P}_v $. It is basically the same code we used above. You can download the IPython notebook corresponding to this post and try different values on the diagonal of $\mathbf{Q}$.

In [9]:
from mpl_toolkits.mplot3d import proj3d
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
fig.set_size_inches([8,8])

ax = fig.add_subplot(111, projection='3d')
ax.set_xlim([0,2])
ax.set_ylim([0,2])
ax.set_zlim([0,2])
ax.set_aspect(1)
ax.set_xlabel('x-axis',fontsize=16)
ax.set_ylabel('y-axis',fontsize=16)
ax.set_zlabel('z-axis',fontsize=16)

y = matrix([1,1,1]).T 
V = matrix([[1,0.25], # columns are v_1, v_2
            [0,0.50],
            [0,0.00]])

Q = matrix([[1,0,0],
            [0,2,0],
            [0,0,3]])

P = V*inv(V.T*Q*V)*V.T*Q
yhat = P*y         # approximant


u = np.linspace(0, 2*np.pi, 100)
v = np.linspace(0, np.pi, 100)

xx = np.outer(np.cos(u), np.sin(v))
yy = np.outer(np.sin(u), np.sin(v))
zz = np.outer(np.ones(np.size(u)), np.cos(v))

xx,yy,yz=map(squeeze,split(tensordot(dstack([xx,yy,zz]),Q,axes=1),3,axis=2))

ellipsoid=ax.plot_surface(xx+y[0,0], yy+y[1,0], zz+y[2,0],  
                           rstride=4, cstride=4, color='gray',alpha=0.3,lw=0.25)

ax.plot3D([y[0,0],0],[y[1,0],0],[y[2,0],0],'r-',lw=3)
ax.plot3D([y[0,0]],[y[1,0]],[y[2,0]],'ro')

ax.plot3D([V[0,0],0],[V[1,0],0],[V[2,0],0],'b-',lw=3)
ax.plot3D([V[0,0]],[V[1,0]],[V[2,0]],'bo')
ax.plot3D([V[0,1],0],[V[1,1],0],[V[2,1],0],'b-',lw=3)
ax.plot3D([V[0,1]],[V[1,1]],[V[2,1]],'bo')

ax.plot3D([yhat[0,0],0],[yhat[1,0],0],[yhat[2,0],0],'g--',lw=3)
ax.plot3D([yhat[0,0]],[yhat[1,0]],[yhat[2,0]],'go')

x2, y2, _ = proj3d.proj_transform(y[0,0],y[1,0],y[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{y}$", 
    xy = (x2, y2), xytext = (-20, 20), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(yhat[0,0],yhat[1,0],yhat[2,0], ax.get_proj())
ax.annotate(
    "$\hat{\mathbf{y}}$", 
    xy = (x2, y2), xytext = (40, 30), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,0],V[1,0],V[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_1$", 
    xy = (x2, y2), xytext = (120, 10), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,1],V[1,1],V[2,1], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_2$", 
    xy = (x2, y2), xytext = (-30, 30), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

Summary

In this section, we extended the concept of a projection operator beyond one dimension and showed the corresponding geometric concepts that tie the projection operator to MMSE problems in more than one dimension.

References

This post was created using the nbconvert utility from the source IPython Notebook which is available for download from the main github site for this blog. The projection concept is masterfully discussed in the classic Strang, G. (2003). Introduction to linear algebra. Wellesley Cambridge Pr. Also, some of Dr. Strang's excellent lectures are available on MIT Courseware. I highly recommend these as well as the book.

Appendix

Below is some extra code to handle the more general case where there is a rotation as well as a weighting of the axes. Note the projection operator is constructed exactly the same way.

In [15]:
from mpl_toolkits.mplot3d import proj3d
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
fig.set_size_inches([8,8])

ax = fig.add_subplot(111, projection='3d')
ax.set_xlim([0,2])
ax.set_ylim([0,2])
ax.set_zlim([0,2])
ax.set_aspect(1)
ax.set_xlabel('x-axis',fontsize=16)
ax.set_ylabel('y-axis',fontsize=16)
ax.set_zlabel('z-axis',fontsize=16)

y = matrix([1,1,1]).T 
V = matrix([[1,0.25], # columns are v_1, v_2
            [0,0.50],
            [0,0.00]])

def rotation_matrix(angle,axis='z'):                  
    angle = angle/180.*pi
    if axis=='z':
        return matrix([[cos(angle),sin(angle),0],
                  [sin(-angle),cos(angle),0],
                  [0,0,1]])
    elif axis=='y':
        return matrix([[cos(angle),sin(angle),0],
                       [0,1,0],
                       [sin(-angle),cos(angle),0]])
    elif axis=='x':
        return matrix([[1,0,0],
                       [cos(angle),sin(angle),0],
                       [sin(-angle),cos(angle),0]])
          
S = matrix([[3,0,0],
            [0,2,0],
            [0,0,1]])

R = rotation_matrix(30)*rotation_matrix(30,'x')*rotation_matrix(40,'y')

Q = R.T*S*R # apply 3-D rotations

P = V*inv(V.T*Q*V)*V.T*Q # build projection matrix
yhat = P*y         # approximant


u = np.linspace(0, 2*np.pi, 100)
v = np.linspace(0, np.pi, 100)

xx = np.outer(np.cos(u), np.sin(v))
yy = np.outer(np.sin(u), np.sin(v))
zz = np.outer(np.ones(np.size(u)), np.cos(v))

xx,yy,yz=map(squeeze,split(tensordot(dstack([xx,yy,zz]),Q,axes=1),3,axis=2))

ellipsoid=ax.plot_surface(xx+y[0,0], yy+y[1,0], zz+y[2,0],  
                           rstride=4, cstride=4, color='gray',alpha=0.3,lw=0.25)

ax.plot3D([y[0,0],0],[y[1,0],0],[y[2,0],0],'r-',lw=3)
ax.plot3D([y[0,0]],[y[1,0]],[y[2,0]],'ro')

ax.plot3D([V[0,0],0],[V[1,0],0],[V[2,0],0],'b-',lw=3)
ax.plot3D([V[0,0]],[V[1,0]],[V[2,0]],'bo')
ax.plot3D([V[0,1],0],[V[1,1],0],[V[2,1],0],'b-',lw=3)
ax.plot3D([V[0,1]],[V[1,1]],[V[2,1]],'bo')

ax.plot3D([yhat[0,0],0],[yhat[1,0],0],[yhat[2,0],0],'g--',lw=3)
ax.plot3D([yhat[0,0]],[yhat[1,0]],[yhat[2,0]],'go')

x2, y2, _ = proj3d.proj_transform(y[0,0],y[1,0],y[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{y}$", 
    xy = (x2, y2), xytext = (-20, 20), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(yhat[0,0],yhat[1,0],yhat[2,0], ax.get_proj())
ax.annotate(
    "$\hat{\mathbf{y}}$", 
    xy = (x2, y2), xytext = (40, 30), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,0],V[1,0],V[2,0], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_1$", 
    xy = (x2, y2), xytext = (120, 10), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

x2, y2, _ = proj3d.proj_transform(V[0,1],V[1,1],V[2,1], ax.get_proj())
ax.annotate(
    "$\mathbf{v}_2$", 
    xy = (x2, y2), xytext = (-30, 30), fontsize=24,
    textcoords = 'offset points', ha = 'right', va = 'bottom',
    bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
    arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()