An Intuitive Explanation of the OLS Estimator for both Traditional and Matrix Algebra

Posted by Mischa Fisher in Econometrics   
Mon 19 October 2015

The Ordinary Least Squares estimator, β^ \hat{\beta} is the first thing one learns in econometrics. It has two forms, one in standard algebra and one in matrix algebra, but it's important to remember the two are equivalent:

β^=cov^(x,y)var(x)=(XX)1XY \hat{\beta} = \frac{\hat{cov}(x,y)}{var(x)} = \mathbf{({X}'X)^{-1}{X}'Y}

I think most students will find it extremely easy to get lost in notation and miss the link to be made with real world data. The following exercise is a helpful way I found to make sure one continues to make the link between traditional 'simple' notation, Matrix Algebra notation, and the underlying data and arithmetic that goes into the ordinary linear regression estimator.

Deriving the Algebraic Notation for the Simple Bivariate Model

The familiar simple bivariate model is expressed as an independent observation as a function of an intercept, a regression coefficient, and an error term (respectively):

yi=b0+b1x1+ei y_{i} = b_{0} + b_{1}x_{1} + e_{i}

Where we wish to minimize the sum of squared errors (SSE):

minimize:SSE=i=1Nei2 minimize: SSE = \sum_{i=1}^{N} e_{i}^{2}

To do so we isolate the error of the regression to make it a function of the other terms:

ei=yib0b1x1 e_{i} = y_{i} - b_{0} - b_{1}x_{1}

Then substitute:

minimize:i=1N(yib0b1x1)2 minimize: \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2}

For our purposes, we'll ignore the derivation of the intercept and take it as a given that it is yˉβ1^xˉ \bar{y} - \hat{\beta_{1}}\bar{x} and just solve for the β^ \hat{\beta} slope coefficient. To minimize the errors, we need to take the partial derivative with respect to b1 b_{1}

SSEb1=b1[i=1N(yib0b1x1)2] \frac{\partial SSE }{\partial b_{1}} = \frac{\partial }{\partial b_{1}} \left [ \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ]

Move the summation operator through since the the derivative of a sum is equal to the sum of the derivatives:

SSEb1=i=1N[b1(yib0b1x1)2] \frac{\partial SSE }{\partial b_{1}} = \sum_{i=1}^{N} \left [ \frac{\partial }{\partial b_{1}} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ]

Take the derivative (using the chain rule), then setting it equal to 0 for the first order condition to find the min/max:

SSEb1=2i=1Nxi(yib0b1x1)=0 \frac{\partial SSE }{\partial b_{1}} = -2 \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1}) = 0

Then multiply by 12 - \frac{1}{2} to simplify:

0=i=1Nxi(yib0b1x1) 0 = \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1})

Substitute the solution for the intercept, b0 b_{0} , that we took as a given above:

0=i=1Nxi(yi(yˉβ1^xˉ)b1x1) 0 = \sum_{i=1}^{N} x_{i}(y_{i} - (\bar{y} - \hat{\beta_{1}}\bar{x} ) - b_{1}x_{1})

Then rearrange and distribute the summation operator to solve for β1^ \hat{\beta_{1}} :

β1^=i=1N(yiyˉ)xii=1N(xixˉ)xi \hat{\beta_{1}} = \frac{\sum_{i=1}^{N} (y_{i} - \bar{y} )x_{i}}{ \sum_{i=1}^{N} (x_{i} - \bar{x})x_{i} }

Which is algebraically equivalent to:

cov^(x,y)var(x)=1ni=1n(xixˉ)(yiyˉ)1ni=1n(xixˉ)2=β1^ \frac{\hat{cov}(x,y)}{var(x)} = \frac{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y} )}{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^{2} } = \hat{\beta_{1}}

Deriving the Matrix Algebra Notation

Despite typically not being taught until the senior undergraduate, or graduate, level: the derivation for the matrix notation is actually a little more straightforward (as long as one remembers the rules for matrix algebra (which I typically do not)).

First, visualize the linear model again, but this time in matrix notation where Y and e are vectors of the observations and X is the matrix of the independent variables and their observations:

Y=XB+e \mathbf{Y = XB + e}

Just as before, we want to minimize the sum of squared errors

minimize:SSE=ee minimize: SSE = \mathbf{e'e}

Rearranging and substituting yields:

SSE=(YXB)(YXB) SSE = \mathbf{(Y - XB)'(Y - XB)}

Push the transpose operator through:

SSE=(YBX)(YXB) SSE = \mathbf{(Y' - B'X')(Y - XB)}

Multiply the whole equation out:

SSE=YYYXBBXY+BXXB SSE = \mathbf{Y'Y - Y'XB - B'X'Y + B'X'XB}

Simplify the two equivalent terms in the middle:

SSE=YY2YXB+BXXB SSE = \mathbf{Y'Y - 2Y'XB + B'X'XB}

Then again as before we'll take the partial derivative for the first order condition:

SSEB=B(YY2YXB+BXXB) \frac{\partial SSE }{\mathbf{\partial B}} = \frac{\partial }{\mathbf{\partial B}} (\mathbf{Y'Y - 2Y'XB + B'X'XB)}

And set to 0 to find the minimum:

SSEB=2XY+2XXB=0 \frac{\partial SSE }{\mathbf{\partial B}} = \mathbf{-2X'Y + 2X'XB} = 0

Then isolate B \mathbf{B} and simplify:

B=(XX)1XY \mathbf{B} = \mathbf{(X'X)^{-1}X'Y}

Getting Real World Data

Pulling data from autotrader.com for Honda CR-V's, I came up with 9 observations across three simple variables: Price, Mileage, and Year.

PRICE
YEAR
MILEAGE
$19998 2012 16568
$16995 2011 68399
$22491 2013 23813
$27571 2014 15156
$25998 2014 17201
$24000 2012 28946
$15495 2010 87440
$13290 2007 83060
$8449 2006 153549

Using car price as a function of car year in a simple bivariate model, one can find the OLS slope coefficient in R using the simple call:

options("scipen"=100, "digits"=4)
price <- c(19998,16995,22491,27571,25998,24000,15495,13290,8444)
mileage <- c(16568,68399,23813,15156,17201,28946,87440,83060,153549)
year <- c(2012,2011,2013,2014,2014,2012,2010,2007,2006)
crv <- data.frame(mileage,price,year)
lm(crv$price ~ crv$year)

Which yields the basic result that the asking price of the car goes down by $2103 for each year it gets older.

The Bivariate Example Using Simple Algebra and Arithmetic

This is the important part. It's tedious, but straightforward, and writing it out all by hand will really remind you that computers are remarkable tools.

The derivation results from Part I tell us:

cov^(x,y)var(x)=1ni=1n(xixˉ)(yiyˉ)1ni=1n(xixˉ)2=β1^ \frac{\hat{cov}(x,y)}{var(x)} = \frac{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y} )}{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^{2} } = \hat{\beta_{1}}

In this case xˉ \bar{x} is the sample mean for years, which is 2011, and yˉ \bar{y} is the sample mean for price, which is $19365.

Expanding the estimator out to its completely tangible form yields the ridiculously cumbersome equation:

((20122011)(1999819365))+((20112011)(1699519365))+((20132011)(2249119365))+((20142011)(2757119365))+((20142011)(2599819365))+((20122011)(2400019365))+((20102011)(1549519365))+((20072011)(1329019365))+((20062011)(844919365))((20122011)2)+((20112011)2)+((20132011)2)+((20142011)2)+((20142011)2)+((20122011)2)+((20102011)2)+((20072011)2)+((20062011)2) \frac{((2012-2011)(19998-19365))+ ((2011-2011)(16995-19365))+ ((2013-2011)(22491-19365))+ ((2014-2011)(27571-19365))+ ((2014-2011)(25998-19365))+ ((2012-2011)(24000-19365))+ ((2010-2011)(15495-19365))+ ((2007-2011)(13290-19365))+ ((2006-2011)(8449-19365))}{((2012-2011)^{2})+((2011-2011)^{2})+((2013-2011)^{2})+((2014-2011)^{2})+((2014-2011)^{2})+((2012-2011)^{2})+((2010-2011)^{2})+((2007-2011)^{2})+((2006-2011)^{2})}

Which simplifies to

13878766=$2102.83 \frac{138787}{66} = \$2102.83

Almost exactly what R told us the coefficient would be!

The Bivariate Example Using Matrix Algebra

Things get a little trickier here, but the process is the same, the bivariate estimator is:

B^=(XX)1XY \mathbf{\hat{B}} = \mathbf{(X'X)^{-1}X'Y}

Substituting our real world data for the general matrices leaves the form:

([120121201112013120141201412012120101200712006][120121201112013120141201412012120101200712006])1[120121201112013120141201412012120101200712006][19998169952249127571259982400015495132908449]=B^ \left({{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}}'{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}\right)^{-1} {{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}}' \begin{bmatrix} 19998\\ 16995\\ 22491\\ 27571\\ 25998\\ 24000\\ 15495\\ 13290\\ 8449 \end{bmatrix} = \mathbf{\hat{B}}

Which simplifies from 4 sets of matrices down to two as:

[61274.6717171730.469696960630.4696969696960.015151515151515][174287350629944] \begin{bmatrix} 61274.67171717 & -30.4696969606 \\ -30.469696969696 & 0.015151515151515 \end{bmatrix} \begin{bmatrix} 174 287 \\ 350 629 944 \end{bmatrix}

Which multiplies through to:

[42093802102.83] \begin{bmatrix} -420 9380\\ 2102.83 \end{bmatrix}

The first row being the intercept in our regression model; the second being our estimated slope coefficient calculated in both R and using the expanded summation operator by hand!

The Takeaway

Derivations, proofs, and advanced topics in econometrics can be really tricky to wrap your ahead around at first. By going through the motions of making real world data stand in for the notation, it can a) help you really understand the proofs and derivations, and b) remind you just how many trillions of human hours computers save in answering complicated technical questions!


 
    
 
 

Comments