# An Intuitive Explanation of the OLS Estimator for both Traditional and Matrix Algebra

The Ordinary Least Squares estimator, $\hat{\beta}$ is the first thing one learns in econometrics. It has two forms, one in standard algebra and one in matrix algebra, but it's important to remember the two are equivalent:

### $\hat{\beta} = \frac{\hat{cov}(x,y)}{var(x)} = \mathbf{({X}'X)^{-1}{X}'Y}$

I think most students will find it extremely easy to get lost in notation and miss the link to be made with real world data. The following exercise is a helpful way I found to make sure one continues to make the link between traditional 'simple' notation, Matrix Algebra notation, and the underlying data and arithmetic that goes into the ordinary linear regression estimator.

### Deriving the Algebraic Notation for the Simple Bivariate Model

The familiar simple bivariate model is expressed as an independent observation as a function of an intercept, a regression coefficient, and an error term (respectively):

$y_{i} = b_{0} + b_{1}x_{1} + e_{i}$

Where we wish to minimize the sum of squared errors (SSE):

$minimize: SSE = \sum_{i=1}^{N} e_{i}^{2}$

To do so we isolate the error of the regression to make it a function of the other terms:

$e_{i} = y_{i} - b_{0} - b_{1}x_{1}$

Then substitute:

$minimize: \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2}$

For our purposes, we'll ignore the derivation of the intercept and take it as a given that it is $\bar{y} - \hat{\beta_{1}}\bar{x}$ and just solve for the $\hat{\beta}$ slope coefficient. To minimize the errors, we need to take the partial derivative with respect to $b_{1}$

$\frac{\partial SSE }{\partial b_{1}} = \frac{\partial }{\partial b_{1}} \left [ \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ]$

Move the summation operator through since the the derivative of a sum is equal to the sum of the derivatives:

$\frac{\partial SSE }{\partial b_{1}} = \sum_{i=1}^{N} \left [ \frac{\partial }{\partial b_{1}} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ]$

Take the derivative (using the chain rule), then setting it equal to 0 for the first order condition to find the min/max:

$\frac{\partial SSE }{\partial b_{1}} = -2 \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1}) = 0$

Then multiply by $- \frac{1}{2}$ to simplify:

$0 = \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1})$

Substitute the solution for the intercept, $b_{0}$ , that we took as a given above:

$0 = \sum_{i=1}^{N} x_{i}(y_{i} - (\bar{y} - \hat{\beta_{1}}\bar{x} ) - b_{1}x_{1})$

Then rearrange and distribute the summation operator to solve for $\hat{\beta_{1}}$ :

$\hat{\beta_{1}} = \frac{\sum_{i=1}^{N} (y_{i} - \bar{y} )x_{i}}{ \sum_{i=1}^{N} (x_{i} - \bar{x})x_{i} }$

Which is algebraically equivalent to:

$\frac{\hat{cov}(x,y)}{var(x)} = \frac{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y} )}{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^{2} } = \hat{\beta_{1}}$

### Deriving the Matrix Algebra Notation

Despite typically not being taught until the senior undergraduate, or graduate, level: the derivation for the matrix notation is actually a little more straightforward (as long as one remembers the rules for matrix algebra (which I typically do not)).

First, visualize the linear model again, but this time in matrix notation where **Y** and **e** are vectors of the observations and **X** is the matrix of the independent variables and their observations:

$\mathbf{Y = XB + e}$

Just as before, we want to minimize the sum of squared errors

$minimize: SSE = \mathbf{e'e}$

Rearranging and substituting yields:

$SSE = \mathbf{(Y - XB)'(Y - XB)}$

Push the transpose operator through:

$SSE = \mathbf{(Y' - B'X')(Y - XB)}$

Multiply the whole equation out:

$SSE = \mathbf{Y'Y - Y'XB - B'X'Y + B'X'XB}$

Simplify the two equivalent terms in the middle:

$SSE = \mathbf{Y'Y - 2Y'XB + B'X'XB}$

Then again as before we'll take the partial derivative for the first order condition:

$\frac{\partial SSE }{\mathbf{\partial B}} = \frac{\partial }{\mathbf{\partial B}} (\mathbf{Y'Y - 2Y'XB + B'X'XB)}$

And set to 0 to find the minimum:

$\frac{\partial SSE }{\mathbf{\partial B}} = \mathbf{-2X'Y + 2X'XB} = 0$

Then isolate $\mathbf{B}$ and simplify:

$\mathbf{B} = \mathbf{(X'X)^{-1}X'Y}$

### Getting Real World Data

Pulling data from autotrader.com for Honda CR-V's, I came up with 9 observations across three simple variables: Price, Mileage, and Year.

PRICE |
YEAR |
MILEAGE |
---|---|---|

$19998 | 2012 | 16568 |

$16995 | 2011 | 68399 |

$22491 | 2013 | 23813 |

$27571 | 2014 | 15156 |

$25998 | 2014 | 17201 |

$24000 | 2012 | 28946 |

$15495 | 2010 | 87440 |

$13290 | 2007 | 83060 |

$8449 | 2006 | 153549 |

Using car price as a function of car year in a simple bivariate model, one can find the OLS slope coefficient in R using the simple call:

```
options("scipen"=100, "digits"=4)
price <- c(19998,16995,22491,27571,25998,24000,15495,13290,8444)
mileage <- c(16568,68399,23813,15156,17201,28946,87440,83060,153549)
year <- c(2012,2011,2013,2014,2014,2012,2010,2007,2006)
crv <- data.frame(mileage,price,year)
lm(crv$price ~ crv$year)
```

Which yields the basic result that the asking price of the car goes down by $2103 for each year it gets older.

### The Bivariate Example Using Simple Algebra and Arithmetic

This is the important part. It's tedious, but straightforward, and writing it out all by hand will really remind you that computers are remarkable tools.

The derivation results from Part I tell us:

$\frac{\hat{cov}(x,y)}{var(x)} = \frac{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y} )}{ \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^{2} } = \hat{\beta_{1}}$

In this case $\bar{x}$ is the sample mean for years, which is 2011, and $\bar{y}$ is the sample mean for price, which is $19365.

Expanding the estimator out to its completely tangible form yields the ridiculously cumbersome equation:

###### $\frac{((2012-2011)(19998-19365))+ ((2011-2011)(16995-19365))+ ((2013-2011)(22491-19365))+ ((2014-2011)(27571-19365))+ ((2014-2011)(25998-19365))+ ((2012-2011)(24000-19365))+ ((2010-2011)(15495-19365))+ ((2007-2011)(13290-19365))+ ((2006-2011)(8449-19365))}{((2012-2011)^{2})+((2011-2011)^{2})+((2013-2011)^{2})+((2014-2011)^{2})+((2014-2011)^{2})+((2012-2011)^{2})+((2010-2011)^{2})+((2007-2011)^{2})+((2006-2011)^{2})}$

Which simplifies to

$\frac{138787}{66} = \$2102.83$

Almost exactly what R told us the coefficient would be!

### The Bivariate Example Using Matrix Algebra

Things get a little trickier here, but the process is the same, the bivariate estimator is:

$\mathbf{\hat{B}} = \mathbf{(X'X)^{-1}X'Y}$

Substituting our real world data for the general matrices leaves the form:

$\left({{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}}'{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}\right)^{-1} {{\begin{bmatrix} 1 & 2012 \\ 1 & 2011 \\ 1 & 2013 \\ 1 & 2014 \\ 1 & 2014 \\ 1 & 2012 \\ 1 & 2010 \\ 1 & 2007 \\ 1 & 2006 \end{bmatrix}}}' \begin{bmatrix} 19998\\ 16995\\ 22491\\ 27571\\ 25998\\ 24000\\ 15495\\ 13290\\ 8449 \end{bmatrix} = \mathbf{\hat{B}}$

Which simplifies from 4 sets of matrices down to two as:

$\begin{bmatrix} 61274.67171717 & -30.4696969606 \\ -30.469696969696 & 0.015151515151515 \end{bmatrix} \begin{bmatrix} 174 287 \\ 350 629 944 \end{bmatrix}$

Which multiplies through to:

$\begin{bmatrix} -420 9380\\ 2102.83 \end{bmatrix}$

The first row being the intercept in our regression model; the second being our estimated slope coefficient calculated in both R and using the expanded summation operator by hand!

### The Takeaway

Derivations, proofs, and advanced topics in econometrics can be really tricky to wrap your ahead around at first. By going through the motions of making real world data stand in for the notation, it can a) help you really understand the proofs and derivations, and b) remind you just how many trillions of human hours computers save in answering complicated technical questions!

## Comments