Monday, January 14, 2013

Conditional Expectation for Gaussian Variables

Introduction

By this point, we have developed many tools to deal with computing the conditional expectation. In this section, we discuss a bizarre and amazing coincidence regarding Gaussian random variables and linear projection, a coincidence that is the basis for most of statistical signal processing.

Conditional Expectation by Optimization

Now, let's consider the important case of the zero-mean bivariate Gaussian and try to find a function $h$ that minimizes the mean squared error (MSE). Again, trying to solve for the conditional expectation by minimizing the error over all possible functions $h$ is generally very, very hard. One alternative is to use parameters for the $h$ function and then just optimize over those. For example, we could assume that $h(Y)= \alpha Y$ and then use calculus to find the $\alpha$ parameter.

Let's try this with the zero-mean bivariate Gaussian density,

$$\mathbb{E}((X-\alpha Y )^2) = \mathbb{E}(\alpha^2 Y^2 - 2 \alpha X Y + X^2 )$$

and then differentiate this with respect to $\alpha$ to obtain

$$\mathbb{E}(2 \alpha Y^2 - 2 X Y ) = 2 \alpha \sigma_y^2 - 2 \mathbb{E}(XY) = 0$$

Then, solving for $\alpha$ gives

$$ \alpha = \frac{ \mathbb{E}(X Y)}{ \sigma_y^2 } $$

which means we that

\begin{equation} \mathbb{ E}(X|Y) \approx \alpha Y = \frac{ \mathbb{E}(X Y )}{ \sigma_Y^2 } Y =\frac{\sigma_{X Y}}{ \sigma_Y^2 } Y \end{equation}

where that last equality is just notation. Remember here we assumed a special linear form for $h=\alpha Y$, but we did that for convenience. We still don't know whether or not this is the one true $h_{opt}$ that minimizes the MSE for all such functions.

Conditional Expectation Using Direct Integration

Now, let's try this again by computing $ \mathbb{E}(X|Y)$ in the case of the bivariate Gaussian distribution straight from the definition.

\begin{equation} \mathbb{E}(X|Y) = \int_{\mathbb{ R}} x \frac{f_{X,Y}(x,y)}{f_Y(y)} dx \end{equation}

where

$$ f_{X,Y}(x,y) = \frac{1}{2\pi |\mathbf{R}|^{\frac{1}{2}}} e^{-\frac{1}{2} \mathbf{v}^T \mathbf{R}^{-1} \mathbf{v} } $$

and where

$$ \mathbf{v}= \left[ x,y \right]^T$$

$$ \mathbf{R} = \left[ \begin{array}{cc} \sigma_{x}^2 & \sigma_{xy} \\ \sigma_{xy} & \sigma_{y}^2 \\ \end{array} \right] $$

and with

\begin{eqnarray} \sigma_{xy} &=& \mathbb{E}(xy) \nonumber \\ \sigma_{x}^2 &=& \mathbb{E}(x^2) \nonumber \\ \sigma_{y}^2 &=& \mathbb{E}(y^2) \nonumber
\end{eqnarray}

This conditional expectation (Eq. 4 above) is a tough integral to evaluate, so we'll do it with sympy.

In [3]:
from sympy import Matrix, Symbol, exp, pi, simplify, integrate 
from sympy import stats

sigma_x = Symbol('sigma_x',positive=True)
sigma_y = Symbol('sigma_y',positive=True)
sigma_xy = Symbol('sigma_xy',real=True)
fyy = stats.density(stats.Normal('y',0,sigma_y))(y)
 
R = Matrix([[sigma_x**2, sigma_xy],
            [sigma_xy,sigma_y**2]])
fxy = 1/(2*pi)/sqrt(R.det()) * exp((-Matrix([[x,y]])*R.inv()* Matrix([[x],[y]]))[0,0]/2 )

fcond = simplify(fxy/fyy)

Unfortunately, sympy cannot immediately integrate this without some hints. So, we need to define a positive variable ($u$) and substitute it into the integration

In [7]:
u=Symbol('u',positive=True) # define positive variable

fcond2=fcond.subs(sigma_x**2*sigma_y**2-sigma_xy**2,u) # substitute as hint to integrate
g=simplify(integrate(fcond2*x,(x,-oo,oo))) # evaluate integral
gg=g.subs( u,sigma_x**2 *sigma_y**2 - sigma_xy**2 ) # substitute back in
use( gg, simplify,level=2) # simplify exponent term
Out[7]:
$$\frac{\sigma_{xy} y}{\sigma_{y}^{2}}$$

Thus, by direct integration using sympy, we found

$$ \mathbb{ E}(X|Y) = Y \frac{\sigma_{xy}}{\sigma_{y}^{2}} $$

and this matches the prior result we obtained by direct minimization by assuming that $\mathbb{E}(X|Y) = \alpha Y$ and then solving for the optimal $\alpha$!

The importance of this result cannot be understated: the one true and optimal $h_{opt}$ is a linear function of $Y$.

In other words, assuming a linear function, which made the direct search for an optimal $h(Y)$ merely convenient yields the optimal result! This is a general result that extends for all Gaussian problems. The link between linear functions and optimal estimation of Gaussian random variables is the most fundamental result in statistical signal processing! This fact is exploited in everything from optimal filter design to adaptive signal processing.

We can easily extend this result to non-zero mean problems by inserting the means in the right places as follows:

$$ \mathbb{ E}(X|Y) = \bar{X} + (Y-\bar{Y}) \frac{\sigma_{xy}}{\sigma_{y}^{2}} $$

where $\bar{X}$ is the mean of $X$ (same for $Y$).

Summary

In this section, we showed that the conditional expectation for Gaussian random variables is a linear function, which, by a bizarre coincidence, is also the easiest one to work with. This result is fundamental to all optimal linear filtering problems (e.g. Kalman filter) and is the basis of most of the theory of stochastic processes used in signal processing. Up to this point, we have worked hard to illustrate all of the concepts we will need to unify our understanding of this entire field and figured out multiple approaches to these kinds of problems, most of which are far more difficult to compute. Thus, it is indeed just plain lucky that the most powerful distribution is the easiest to compute as a conditional expectation because it is a linear function. We will come back to this same result again and again as we work our way through these greater concepts.

References

This post was created using the nbconvert utility from the source IPython Notebook which is available for download from the main github site for this blog.

Monday, December 31, 2012

Worked Examples for Conditional Expectation

Introduction

Brzezniak (2000) is a great book because it approaches conditional expectation through a sequence of exercises, which is what we are trying to do here. The main difference is that Brzezniak takes a more abstract measure-theoretic approach to the same problems. Note that you do need to grasp the measure-theoretic to move into more advanced areas in stochastic processes, but for what we have covered so far, working the same problems in his text using our methods is illuminating. It always helps to have more than one way to solve any problem. I urge you to get a copy of his book or at least look at some pages on Google Books. I have numbered the examples corresponding to the book.

Examples

This is Example 2.1 from Brzezniak:

Three coins, 10p, 20p and 50p are tossed. The values of those coins that land heads up are added to work out the total amount. What is the expected total amount given that two coins have landed heads up?

In this case we have we want to compute $\mathbb{E}(\xi|\eta)$ where

$$ \xi = 10 X_{10} + 20 X_{20} +50 X_{50} $$

where $X_i \in { 0,1} $. This represents the sum total value of the heads-up coins. The $\eta$ represents the fact that only two of the three coins are heads-up. Note

$$\eta = X_{10} X_{20} (1-X_{50})+ (1-X_{10}) X_{20} X_{50}+ X_{10} (1-X_{20}) X_{50} $$

is a function that is only non-zero when two of the three coins is heads. Each triple term catches each of these three possibilities. For example, the first term is when the 10p and 20p are heads up and the 50p is heads down.

To compute the conditional expectation, we want to find a function $h$ of $\eta$ that minimizes the MSE

$$ \sum_{X\in{0,1}^3} \frac{1}{8} (\xi - h( \eta ))^2 $$

where the sum is taken over all possible triples of outcomes for $ {X_{10} , X_{20} ,X_{50}}$ and the $\frac{1}{8} = \frac{1}{2^3} $ since each coin has a $\frac{1}{2}$ chance of coming up heads.

Now, the question boils down to what function $h(\eta)$ should we try? Note that $\eta \in {0,1}$ so $h$ takes on only two values. Thus, we only have to try $h(\eta)=\alpha \eta$ and find $\alpha$. Writing this out gives,

$$ \sum_{X\in{0,1}^3} \frac{1}{8} (\xi - \alpha( \eta ))^2 $$

which boils down to solving for $\alpha$,

$$\langle \xi , \eta \rangle = \alpha \langle \eta,\eta \rangle$$

where

$$ \langle \xi , \eta \rangle =\sum_{X\in{0,1}^3} \frac{1}{8} (\xi \eta ) $$

This is tedious and a perfect job for sympy.

In [11]:
import sympy as S

eta = S.Symbol('eta')
xi = S.Symbol('xi')
X10 = S.Symbol('X10')
X20 = S.Symbol('X20')
X50 = S.Symbol('X50')

eta = X10 * X20 *(1-X50 )+ X10 * (1-X20) *(X50 )+ (1-X10) * X20 *(X50 )
xi = 10*X10 +20* X20+ 50*X50 

num=S.summation(xi*eta,(X10,0,1),(X20,0,1),(X50,0,1))
den=S.summation(eta*eta,(X10,0,1),(X20,0,1),(X50,0,1))
alpha=num/den
print alpha
160/3

This means that

$$ \mathbb{E}(\xi|\eta) = \frac{160}{3} \eta $$

which we can check with a quick simulation

In [12]:
import numpy as np
from numpy import array

x=np.random.randint(0,2,(3,5000))
print (160/3.,np.dot(x[:,x.sum(axis=0)==2].T,array([10,20,50])).mean())
(53.333333333333336, 53.177920685959272)

Example

This is example 2.2:

Three coins, 10p, 20p and 50p are tossed as before. What is the conditional expectation of the total amount shown by the three coins given the total amount shown by the 10p and 20p coins only?

For this problem,

$$\eta = 30 X_{10} X_{20} + 20 (1-X_{10}) X_{20} + 10 X_{10} (1-X_{20}) $$

which takes on three values (10,20,30) and only considers the 10p and 20p coins. Here, we'll look for affine functions, $h(\eta) = a \eta + b $.

In [13]:
from sympy.abc import a,b

eta = X10 * X20 * 30 + X10 * (1-X20) *(10 )+ (1-X10) * X20 *(20 )
h = a*eta + b

J=S.summation((xi - h)**2 * S.Rational(1,8),(X10,0,1),(X20,0,1),(X50,0,1))

sol=S.solve( [S.diff(J,a), S.diff(J,b)],(a,b) )
print sol
{b: 25, a: 1}

This means that

$$ \mathbb{E}(\xi|\eta) = 25+ \eta $$

since $\eta$ takes on only four values, ${0,10,20,30}$, we can write this out as

$$ \mathbb{E}(\xi|\eta=0) = 25 $$ $$ \mathbb{E}(\xi|\eta=10) = 35 $$ $$ \mathbb{E}(\xi|\eta=20) = 45 $$ $$ \mathbb{E}(\xi|\eta=30) = 55 $$

The following is a quick simulation to demonstrate this.

In [14]:
x=np.random.randint(0,2,(3,5000))  # random samples for 3 coins tossed
eta=np.dot(x[:2,:].T,array([10,20])) # sum of 10p and 20p

print np.dot(x[:,eta==0].T,array([10,20,50])).mean() # E(xi|eta=0)
print np.dot(x[:,eta==10].T,array([10,20,50])).mean()# E(xi|eta=10)
print np.dot(x[:,eta==20].T,array([10,20,50])).mean()# E(xi|eta=20)
print np.dot(x[:,eta==30].T,array([10,20,50])).mean()# E(xi|eta=30)
26.4120922832
34.5614035088
45.8196721311
55.3743104807

Example

This is Example 2.3

Note that "Lebesgue measure" on $[0,1]$ just means uniformly distributed on that interval. Also, note the the Piecewise object in sympy is not complete at this point in its development, so we'll have to work around that in the following.

In [15]:
x=S.Symbol('x')
c=S.Symbol('c')

xi = 2*x**2

eta=S.Piecewise( (1, 0 < x< S.Rational(1,3)), 
                 (2, S.Rational(1,3) < x< S.Rational(2,3)),
                 (0, S.Rational(2,3) < x < 1))

h = a + b*eta + c*eta**2 
J=S.integrate((xi - h)**2 ,(x,0,1))

sol=S.solve( [S.diff(J,a), 
              S.diff(J,b),
              S.diff(J,c),
              ],
              (a,b,c) )
print sol

print S.piecewise_fold(h.subs(sol))
{c: 8/9, b: -20/9, a: 38/27}
Piecewise((2/27, x < 1/3), (14/27, x < 2/3), (38/27, x < 1))

Thus, collecting this result gives:

$$ \mathbb{E}(\xi|\eta) = \frac{38}{27} - \frac{20}{9}\eta + \frac{8}{9} \eta^2$$

which can be re-written as a piecewise function as

$$\mathbb{E}(\xi|\eta) =\begin{cases} \frac{2}{27} & \text{for}\: 0 < x < \frac{1}{3} \\ \frac{14}{27} & \text{for}\: \frac{1}{3} < x < \frac{2}{3} \\ \frac{38}{27} & \text{for}\: \frac{2}{3}<x < 1 \end{cases} $$

The following is a quick simulation to demonstrate this.

In [16]:
x = np.random.rand(1000)
f,ax= subplots()
ax.hist(2*x**2,bins=array([0,1/3.,2/3.,1])**2*2,normed=True,alpha=.5)
ax.vlines([2/27.,14/27.,38/27.],0,ax.get_ylim()[1],linestyles='--')
ax.set_xlabel(r'$2 x^2$',fontsize=18);

This plot shows the intervals that correspond to the respective domains of $\eta$ with the vertical dotted lines showing the $\mathbb{E}(\xi|\eta) $ for that piece.

Example

This is Example 2.4

In [17]:
x,a=S.symbols('x,a')

xi = 2*x**2

half = S.Rational(1,2)

eta_0=S.Piecewise( (2,    0 <= x < half), 
                   (0, half <= x <= 1),
                 )

eta_1=S.Piecewise( (0,         x < half),
                   (x, half <= x <= 1),
                )


v=S.var('b:3') # coefficients for quadratic function of eta

h = a*eta_0 + (eta_1**np.arange(len(v))*v).sum()

J=S.integrate((xi - h)**2 ,(x,0,1))
sol=S.solve([J.diff(i) for i in v+(a,)],v+(a,))
hsol = h.subs(sol)

f=S.lambdify(x,hsol,'numpy')
print S.piecewise_fold(h.subs(sol))

t = np.linspace(0,1,51,endpoint=False)

fig,ax = subplots()

ax.plot(t, 2*t**2,label=r'$\xi=2 x^2$')
ax.plot(t,[f(i) for i in t],'-x',label=r'$\mathbb{E}(\xi|\eta)$')
ax.plot(t,map(S.lambdify(x,eta_0+eta_1),t),label=r'$\eta(x)$')
ax.set_ylim(ymax = 2.3)
ax.grid()
ax.legend(loc=0);
#ax.plot(t,map(S.lambdify(x,eta),t))
Piecewise((1/6, x < 1/2), (2*x**2, x <= 1))
Out[17]:
<matplotlib.legend.Legend at 0x4b2b910>

The figure shows the $\mathbb{E}(\xi|\eta)$ against $\xi$ and $\eta$. Note that $\xi= \mathbb{E}(\xi|\eta)= 2 x^2$ when $x\in[0,\frac{1}{2}]$ . Assembling the solution gives,

$$\mathbb{E}(\xi|\eta) =\begin{cases} \frac{1}{6} & \text{for}\: 0 \le x < \frac{1}{2} \ 2 x^2 & \text{for}\: \frac{1}{2} < x \le 1 \end{cases}$$

This example warrants more a more detailed explanation since $\eta$ is more complicated. The first question is why did we choose $h(\eta)$ as a quadratic function? Since $\xi$ is a squared function of $x$ and since $x$ is part of $\eta$, we chose a quadratic function so that $h(\eta)$ would contain a $x^2$ in the domain where $\eta=x$. The motivation is that we are asking for a function $h(x)$ that most closely approximates $2x^2$. Well, obviously, the exact function is $h(x)=2 x^2$! Thus, we want $h(x)=2 x^2$ over the domain where $\eta=x$, which is $x\in[\frac{1}{2},1]$ and that is exactly what we have.

We could have used our inner product by considering two separate functions,

$\eta_1 (x) = 2$

where $x\in [0,\frac{1}{2}]$ and

$$\eta_2 (x) = x$$

where $x\in [\frac{1}{2},1]$. Thus, at the point of projection, we have

$$ \mathbb{E}((2 x^2 - 2 c) \cdot 2) = 0$$

which leads to

$$\int_0^{\frac{1}{2}} 2 x^2 \cdot 2 dx = \int_0^{\frac{1}{2}} c 2 \cdot 2 dx $$

and a solution for $c$,

$$ c = \frac{1}{12} $$

Assembling the solution for $x\in[0,\frac{1}{2}]$ gives

$$ \mathbb{E}(\xi|\eta) = \frac{2}{12}$$

We can do the same thing for the other piece, $\eta_2$,

$$ \mathbb{E}((2 x^2 - c x^2) \cdot x) = 0$$

which, by inspection, gives $c=2$. Thus, for $x\in[\frac{1}{2},1]$ , we have

$$ \mathbb{E}(\xi|\eta)= 2 x^2$$

which is what we had before.

Example

This is Exercise 2.6

In [18]:
x,a=S.symbols('x,a')

xi = 2*x**2
eta = 1 - abs(2*x-1)

half = S.Rational(1,2)

eta=S.Piecewise(   (1+(2*x-1),    0 <= x < half), 
                   (1-(2*x-1), half <= x < 1),
                 )

v=S.var('b:3') # assume h is quadratic in eta

h = (eta**np.arange(len(v))*v).sum()

J=S.integrate((xi - h)**2 ,(x,0,1))
sol=S.solve([J.diff(i) for i in v],v)
hsol = h.subs(sol)

print S.piecewise_fold(h.subs(sol))

t = np.linspace(0,1,51,endpoint=False)

fig,ax = subplots()
fig.set_size_inches(5,5)

ax.plot(t, 2*t**2,label=r'$\xi=2 x^2$')
ax.plot(t,[hsol.subs(x,i) for i in t],'-x',label=r'$\mathbb{E}(\xi|\eta)$')
ax.plot(t,map(S.lambdify(x,eta),t),label=r'$\eta(x)$')
ax.legend(loc=0)
ax.grid()
Piecewise((2*x**2 - 2*x + 1, x < 1/2), (2*x + (-2*x + 2)**2/2 - 1, x < 1))

The figure shows that the $\mathbb{E}(\xi|\eta)$ is continuous over the entire domain. The code above solves for the conditional expectation using optimization assuming that $h$ is a quadratic function of $\eta$, but we can also do it by using the inner product. Thus,

$$ \mathbb{E}\left((2 x^2 - h(\eta_1(x)) )\eta_1(x)\right)= \int_0^{\frac{1}{2}} (2 x^2 - h(\eta_1(x)) )\eta_1(x) dx = 0$$

where $\eta_1 = 2x $ for $x\in [0,\frac{1}{2}]$. We can re-write this in terms of $\eta_1$ as

$$ \int_0^1 \left(\frac{\eta_1^2}{2}-h(\eta_1)\right)\eta_1 d\eta_1$$

and the solution jumps right out as $h(\eta_1)=\frac{\eta_1^2}{2}$. Note that $\eta_1\in[0,1]$. Doing the same thing for the other piece,

$$ \eta_2 = 2 - 2 x, \hspace{1em} \forall x\in[\frac{1}{2},1]$$

gives,

$$ \int_0^1 \left(\frac{(2-\eta_2)^2}{2}-h(\eta_2)\right)\eta_2 d\eta_2$$

and again, the optimal $h(\eta_2)$ jumps right out as

$$ h(\eta_2) = \frac{(2-\eta_2)^2}{2} , \hspace{1em} \forall \eta_2\in[0,1]$$

and since $\eta_2$ and $\eta_2$ represent the same variable over the same domain we can just add these up to get the full solution:

$$ h(\eta) = \frac{1}{2} \left( 2 - 2 \eta + \eta^2\right) $$

and then back-substituting each piece for $x$ produces the same solution as sympy.

Example

This is Exercise 2.14

In [19]:
x,a=S.symbols('x,a')
half = S.Rational(1,2)

xi = 2*x**2
eta=S.Piecewise(   (2*x,    0 <= x < half), 
                   ((2*x-1), half <= x ),
                 )

v=S.var('b:3')

h = (eta**np.arange(len(v))*v).sum()

J=S.integrate((xi - h)**2 ,(x,0,1))
sol=S.solve([J.diff(i) for i in v],v)
hsol = h.subs(sol)

print S.piecewise_fold(h.subs(sol))

t = np.linspace(0,1,51,endpoint=False)

fig,ax = subplots()
fig.set_size_inches(5,5)

ax.plot(t, 2*t**2,label=r'$\xi=2 x^2$')
ax.plot(t,[hsol.subs(x,i) for i in t],'-x',label=r'$\mathbb{E}(\xi|\eta)$')
ax.plot(t,map(S.lambdify(x,eta),t),label=r'$\eta(x)$')
ax.legend(loc=0)
ax.grid()
Piecewise((2*x**2 + x + 1/4, x < 1/2), (x + (2*x - 1)**2/2 - 1/4, 1/2 <= x))

As before, using the inner product for this problem, gives:

$$ \int_0^1 \left(\frac{\eta_1^2}{2}-h(\eta_1)\right)\eta_1 d\eta_1=0$$

and the solution jumps right out as $$h(\eta_1)=\frac{\eta_1^2}{2} , \hspace{1em} \forall \eta_1\in[0,1$$

where $\eta_1(x)=2x$. Doing the same thing for $\eta_2=2x-1$ gives,

$$ \int_0^1 \left(\frac{(1+\eta_2)^2}{2}-h(\eta_2)\right)\eta_1 d\eta_2=0$$

with

$$h(\eta_2)=\frac{(1+\eta_2)^2}{2} , \hspace{1em} \forall \eta_2\in[0,1]$$

and then adding these up as before gives the full solution:

$$ h(\eta)= \frac{1}{2} +\eta + \eta^2$$

Back-substituting each piece for $x$ produces the same solution as sympy.

In [20]:
xs = np.random.rand(100)
print np.mean([(2*i**2-hsol.subs(x,i))**2 for i in xs])
print S.integrate((2*x**2-hsol)**2,(x,0,1)).evalf()
0.266422848567040
0.270833333333333

Summary

We worked out some of the great examples in Brzezniak's book using our methods as a way to show multiple ways to solve the same problem. In particular, comparing Brzezniak's more measure-theoretic methods to our less abstract techniques is a great way to get a handle on those concepts which you will need for more advanced study in stochastic process.

As usual, the corresponding IPython Notebook notebook for this post is available for download here

Comments and corrections welcome!

References

  • Brzezniak, Zdzislaw, and Tomasz Zastawniak. Basic stochastic processes: a course through exercises. Springer, 2000.