The Ordinary Least Squares estimator, β^
is the first thing one learns in econometrics. It has two forms, one in standard algebra and one in matrix algebra, but it's important to remember the two are equivalent:
β^=var(x)cov^(x,y)=(X′X)−1X′Y
I think most students will find it extremely easy to get lost in notation and miss the link to be made with real world data. The following exercise is a helpful way I found to make sure one continues to make the link between traditional 'simple' notation, Matrix Algebra notation, and the underlying data and arithmetic that goes into the ordinary linear regression estimator.
Deriving the Algebraic Notation for the Simple Bivariate Model
The familiar simple bivariate model is expressed as an independent observation as a function of an intercept, a regression coefficient, and an error term (respectively):
yi=b0+b1x1+ei
Where we wish to minimize the sum of squared errors (SSE):
minimize:SSE=i=1∑Nei2
To do so we isolate the error of the regression to make it a function of the other terms:
ei=yi−b0−b1x1
Then substitute:
minimize:i=1∑N(yi−b0−b1x1)2
For our purposes, we'll ignore the derivation of the intercept and take it as a given that it is yˉ−β1^xˉ
and just solve for the β^
slope coefficient. To minimize the errors, we need to take the partial derivative with respect to b1
∂b1∂SSE=∂b1∂[i=1∑N(yi−b0−b1x1)2]
Move the summation operator through since the the derivative of a sum is equal to the sum of the derivatives:
∂b1∂SSE=i=1∑N[∂b1∂(yi−b0−b1x1)2]
Take the derivative (using the chain rule), then setting it equal to 0 for the first order condition to find the min/max:
∂b1∂SSE=−2i=1∑Nxi(yi−b0−b1x1)=0
Then multiply by −21
to simplify:
0=i=1∑Nxi(yi−b0−b1x1)
Substitute the solution for the intercept, b0
, that we took as a given above:
0=i=1∑Nxi(yi−(yˉ−β1^xˉ)−b1x1)
Then rearrange and distribute the summation operator to solve for β1^
:
β1^=∑i=1N(xi−xˉ)xi∑i=1N(yi−yˉ)xi
Which is algebraically equivalent to:
var(x)cov^(x,y)=n1∑i=1n(xi−xˉ)2n1∑i=1n(xi−xˉ)(yi−yˉ)=β1^
Deriving the Matrix Algebra Notation
Despite typically not being taught until the senior undergraduate, or graduate, level: the derivation for the matrix notation is actually a little more straightforward (as long as one remembers the rules for matrix algebra (which I typically do not)).
First, visualize the linear model again, but this time in matrix notation where Y and e are vectors of the observations and X is the matrix of the independent variables and their observations:
Y=XB+e
Just as before, we want to minimize the sum of squared errors
minimize:SSE=e′e
Rearranging and substituting yields:
SSE=(Y−XB)′(Y−XB)
Push the transpose operator through:
SSE=(Y′−B′X′)(Y−XB)
Multiply the whole equation out:
SSE=Y′Y−Y′XB−B′X′Y+B′X′XB
Simplify the two equivalent terms in the middle:
SSE=Y′Y−2Y′XB+B′X′XB
Then again as before we'll take the partial derivative for the first order condition:
∂B∂SSE=∂B∂(Y′Y−2Y′XB+B′X′XB)
And set to 0 to find the minimum:
∂B∂SSE=−2X′Y+2X′XB=0
Then isolate B
and simplify:
B=(X′X)−1X′Y
Getting Real World Data
Pulling data from autotrader.com for Honda CR-V's, I came up with 9 observations across three simple variables: Price, Mileage, and Year.
PRICE |
YEAR |
MILEAGE |
$19998 |
2012 |
16568 |
$16995 |
2011 |
68399 |
$22491 |
2013 |
23813 |
$27571 |
2014 |
15156 |
$25998 |
2014 |
17201 |
$24000 |
2012 |
28946 |
$15495 |
2010 |
87440 |
$13290 |
2007 |
83060 |
$8449 |
2006 |
153549 |
Using car price as a function of car year in a simple bivariate model, one can find the OLS slope coefficient in R using the simple call:
options("scipen"=100, "digits"=4)
price <- c(19998,16995,22491,27571,25998,24000,15495,13290,8444)
mileage <- c(16568,68399,23813,15156,17201,28946,87440,83060,153549)
year <- c(2012,2011,2013,2014,2014,2012,2010,2007,2006)
crv <- data.frame(mileage,price,year)
lm(crv$price ~ crv$year)
Which yields the basic result that the asking price of the car goes down by $2103 for each year it gets older.
The Bivariate Example Using Simple Algebra and Arithmetic
This is the important part. It's tedious, but straightforward, and writing it out all by hand will really remind you that computers are remarkable tools.
The derivation results from Part I tell us:
var(x)cov^(x,y)=n1∑i=1n(xi−xˉ)2n1∑i=1n(xi−xˉ)(yi−yˉ)=β1^
In this case xˉ
is the sample mean for years, which is 2011, and yˉ
is the sample mean for price, which is $19365.
Expanding the estimator out to its completely tangible form yields the ridiculously cumbersome equation:
((2012−2011)2)+((2011−2011)2)+((2013−2011)2)+((2014−2011)2)+((2014−2011)2)+((2012−2011)2)+((2010−2011)2)+((2007−2011)2)+((2006−2011)2)((2012−2011)(19998−19365))+((2011−2011)(16995−19365))+((2013−2011)(22491−19365))+((2014−2011)(27571−19365))+((2014−2011)(25998−19365))+((2012−2011)(24000−19365))+((2010−2011)(15495−19365))+((2007−2011)(13290−19365))+((2006−2011)(8449−19365))
Which simplifies to
66138787=$2102.83
Almost exactly what R told us the coefficient would be!
The Bivariate Example Using Matrix Algebra
Things get a little trickier here, but the process is the same, the bivariate estimator is:
B^=(X′X)−1X′Y
Substituting our real world data for the general matrices leaves the form:
⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡111111111201220112013201420142012201020072006⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤′⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡111111111201220112013201420142012201020072006⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞−1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡111111111201220112013201420142012201020072006⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤′⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡19998169952249127571259982400015495132908449⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=B^
Which simplifies from 4 sets of matrices down to two as:
[61274.67171717−30.469696969696−30.46969696060.015151515151515][174287350629944]
Which multiplies through to:
[−42093802102.83]
The first row being the intercept in our regression model; the second being our estimated slope coefficient calculated in both R and using the expanded summation operator by hand!
The Takeaway
Derivations, proofs, and advanced topics in econometrics can be really tricky to wrap your ahead around at first. By going through the motions of making real world data stand in for the notation, it can a) help you really understand the proofs and derivations, and b) remind you just how many trillions of human hours computers save in answering complicated technical questions!
Comments