Wed 01 March 2017

Distributional Parameters

Posted by Mischa Fisher in Econometrics   

Part of my thesis involved modeling survival times against parametric distributions, such as the Weibull, log-logistic, and exponential distributions.

One of the fun aspects of distribution theory is seeing how different parameter specifications can make some distributions special forms of other distributions. For today's quick chart, a lead in to the subject by looking at how a couple of commonly used survival analysis parameters resemble one another (with the parameter specifications highlighted in the R code below).

x_lower <- 0
x_upper <- 10
max_height2 <- max(dexp(x_lower:x_upper, rate = 1, log = FALSE), 
               dweibull(x_lower:x_upper, shape = 1, log = FALSE),
               dlogis(x_lower:x_upper, scale = 1, log = FALSE))
ggplot(data.frame(x = c(x_lower, x_upper)), aes(x = x)) + xlim(x_lower, x_upper) + 
 ylim(0, max_height2) +
 stat_function(fun = dexp, args = list(rate = 2), aes(colour = "Exponential")) + 
 stat_function(fun = dweibull, args = list(shape = 2), aes(colour = "Weibull")) + 
 stat_function(fun = dlogis, args = list(scale = 2), aes(colour = "Logistic")) + 
 scale_color_manual("Distribution", values = c("blue", "green", "red")) +
labs(x = "\n x", y = "f(x) \n", 
   title = "Common Survival Analysis Distribution Density Plots \n") + 
theme(plot.title = element_text(hjust = 0.5), 
    axis.title.x = element_text(face="bold", colour="blue", size = 12),
    axis.title.y = element_text(face="bold", colour="blue", size = 12),
    legend.title = element_text(face="bold", size = 10),
    legend.position = "top") + theme_economist()

Read more...


Tue 01 March 2016

Distributions and Their Parameters

Posted by Mischa Fisher in Econometrics   

Distribution theory

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

If you're an econometrics student, it's likely the first time you're exposed to the CLT is when discussing the OLS estimator in the context of testing hypothesis about the true parameters in order to form confidence intervals: when relying on asymptotics, the normality of the error distributions is not required because as long as the normal 5 Gauss-Markov assumptions are satisfied, the distribution of the OLS estimator will converge to a normal distribution as n goes to infinity.

The CLT is pretty neat and is given short shrift in the context of econometrics, so, here's a brief experiment one can perform in R to illustrate what happens as the theorem comes into effect.

Starting with the Weibull distribution:

plot(sort(rweibull(10000, shape=1)), main="The Weibull Distribution")

We then sample the means of the distribution in increasing replications, then draw histograms from the distributions.

hist(colMeans(replicate(30,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 30 Replications")
hist(colMeans(replicate(300,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 300 Replications")
hist(colMeans(replicate(3000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 3,000 Replications")
hist(colMeans(replicate(30000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 30,000 Replications")
hist(colMeans(replicate(300000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 300,000 Replications")
hist(colMeans(replicate(3000000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 3,000,000 Replications")

Revealing, a very wonderful .gif :

As the replication size increases, the histogram begins to resemble a normal distribution. Neat!

UPDATE:

A friend writes:

You should emphasize more the process that you are taking the MEAN of sub samples of your 'population' which is Weibull distributed. Then, by creating a vector of these means, one is able to show that these "means" converge to a normal distribution as N approaches infinity. As it is, to the 'less' experienced reader perhaps will fail to realize that you take the means of sub samples of that pop'n, and then it is the means which become normally distributed.

Good point; thanks Keith!

Read more...


Wed 03 February 2016

State Budgets and Populations (a.k.a Why Illinois is in the shape it's in)

Posted by Mischa Fisher in Economics   

Given Illinois' current budget impasse, I thought it would be interesting to do a five minute analysis looking at how the size of each state's budget varies in proportion to its population. This is obviously a very shallow examination, and one could spends weeks digging through budget numbers, federal transfers, rural-urban splits, poverty and education levels, industry compositions, unfunded pension liabilities, worker's compensation costs, etc. etc. Still, summary statistics exist for a reason; and 5 minute analyses can be useful exercises.

Pulling data from Wikipedia's maintained list of US State budgets (here), and from the U.S. Census' estimate of 2015 state populations (here)

All States

First, a look at all U.S. States:

I knew Illinois' budget has historically been unsustainable, but, I was surprised at just how much of an outlier the state is. Mercatus now has Illinois rated dead last in the country in terms of fiscal solvency (link here), and with this chart, one can see why!

Just the Large States

Comparing large states to large states, here are all the states with populations above 6 million people.

Just the Small States

And in the same spirit, the small states:

Here it's worth noting that Illinois does not have the largest per capita budget; that honor goes to Alaska. Illinois simply is the largest deviator from the overall trend line in absolute dollar terms. That being said, since State budgets include federal transfer dollars for federal programs (infrastructure, heating assistance, etc. etc.), it's not hard to see why Alaska, which has very few people in it but a lot of federally supported infrastructure, is the highest per capita budget state.

UPDATE:

What was intended as a five minute "hey that's interesting!" analysis ended up exploding on the internet. With over 100K page views in 12 hours, the response was certainly unexpected. On that note, a few people on Reddit mentioned they'd be interested in seeing the log of the data. So here it is:

More importantly, I think it's important for people to remember that Wikipedia data is not always particularly accurate, nor is it necessarily an apples to apples comparison. Some of the data listed on the page is for single year periods, other bits of data are for multi-year periods. The script I used to plot takes those things into consideration, but, it may have errors given the inconsistency in Wikipedia's data. Time allowing I'll source better data and replot at some point in the future.

UPDATE 2:

NASBO has a dataset on state spending, I took some time to manually write down the state and spending columns as vectors in R (so there may be transcription errors) and the log of the data produces a result very different than Wikipedia's data:

So the story of Illinois' terrible fiscal condition could very well be more complicated than can be captured in a single graph. As in most things in life; the issue is ...

Read more...


Sat 23 January 2016

Renting vs. Owning in Chicago

Posted by Mischa Fisher in Economics   

Moving to Chicago this past weekend prompted the age old question, should one rent or should one buy?

Independent of the qualitative and lifestyle differences between the two choices, I was curious how - strictly speaking - the finances between the two options worked out. Using the Case Schiller Price Index for Condos in the Chicago metro, and the historical return rate on real estate prices, worked into a short R function, produced the below results.

Worth noting, the calculation included:

  • The opportunity cost of capital
  • Historical returns on real estate prices
  • For Owning: Mortgage, property taxes, closing costs, Home Owners Association fees, cost of ownership
  • For Renting: Rent, Utilities

However, it did not include:

  • Mortgage interest Deduction
  • Non linear mortgage amortization. (This one is linear, rather than being skewed towards the latter years as would actually happen on a traditional amortization schedule)
  • This assumes a 20% downpayment, rather than some other available options, such as the FHA loans that allow for as low as 3.5%.

The Result:

Given these basic assumptions on relative costs, it seems to confirm the somewhat common folk wisdom that it takes about 4-5 years for the initial hit of the closing costs to be paid off by saved expenses and amortization of the mortgage loan.

Read more...


Tue 08 December 2015

How Much Does Rural Living Predict Broadband Speed?

Posted by Mischa Fisher in Economics   

I was recently brushing up on the current status of the telecommunications industry in the United States, and I became curious about how much a state's rural population predicted its overall average internet connection speed levels.

Pulling data on average speeds from (here), which sources Akamai's State of the Internet report, and data on the rural/urban population by state from Iowa State University's site (here), reveals the below plot:

While noticeable, the effect here is pretty minor overall. It would take a shift of about 20% of the state's population into an urban environment to shift average speeds up by a single Mbps.

How About Globally?

Curious about how well this predicts things generally, I did the same thing looking at global data. Wikipedia has a concise list of countries by internet connection speeds (here), and the World Bank maintains a time series list of urban/rural population (here).

Pulling those two datasets together, reveals the following:

Visually this looks similar, although the linear regression slope is a little steeper. It would take a shift of about 10% of an average country's population to shift the speed up by a Mbps.

Conclusion

At the end of the day, this exercise is likely too simple to shine much light on the phenomenon. Economies of scale in terms of population density would, all else equal, suggest that as density goes up, so would speed. But the 'average' speed by state hides the interesting variation that goes on within the county and city level that I imagine would more apparently show the swings based on not just population density, but, also regulatory factors such as the ease of installation and market entry, access to public conduit and utility maps, etc. etc. Looking at the data more locally would probably be more meaningful than trying to examine aggregate state or country data, because there is likely some regional smoothing that goes on in the 'average speed' metric that underestimates the effect.

Read more...


Thu 26 November 2015

A Brief Exercise Illustrating the Central Limit Theorem

Posted by Mischa Fisher in Econometrics   

Succinctly, the Central Limit Theorem can be expressed as:

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

If you're an econometrics student, it's likely the first time you're exposed to the CLT is when discussing the OLS estimator in the context of testing hypothesis about the true parameters in order to form confidence intervals: when relying on asymptotics, the normality of the error distributions is not required because as long as the normal 5 Gauss-Markov assumptions are satisfied, the distribution of the OLS estimator will converge to a normal distribution as n goes to infinity.

The CLT is pretty neat and is given short shrift in the context of econometrics, so, here's a brief experiment one can perform in R to illustrate what happens as the theorem comes into effect.

Starting with the Weibull distribution:

plot(sort(rweibull(10000, shape=1)), main="The Weibull Distribution")

We then sample the means of the distribution in increasing replications, then draw histograms from the distributions.

hist(colMeans(replicate(30,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 30 Replications")
hist(colMeans(replicate(300,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 300 Replications")
hist(colMeans(replicate(3000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 3,000 Replications")
hist(colMeans(replicate(30000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 30,000 Replications")
hist(colMeans(replicate(300000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 300,000 Replications")
hist(colMeans(replicate(3000000,rweibull(100,shape=1))),breaks="Scott", xlab="Sample Means", main="Histogram for 3,000,000 Replications")

Revealing, a very wonderful .gif :

As the replication size increases, the histogram begins to resemble a normal distribution. Neat!

UPDATE:

A friend writes:

You should emphasize more the process that you are taking the MEAN of sub samples of your 'population' which is Weibull distributed. Then, by creating a vector of these means, one is able to show that these "means" converge to a normal distribution as N approaches infinity. As it is, to the 'less' experienced reader perhaps will fail to realize that you take the means of sub samples of that pop'n, and then it is the means which become normally distributed.

Good point; thanks Keith!

Read more...


Wed 04 November 2015

Donald Trump's Campaign Contributions

Posted by Mischa Fisher in Current Events   

Donald Trump recently passed the 100 day mark as the Republican front runner for the Presidential nomination, which means I was wrong (as many friends have reminded me) about him being a short term fad.

To scratch a curious itch, I decided to pull The Donald's campaign contributions from the FEC's personal disclosure database; that is, where has the Donald donated his money over the years. Since I was wrong about how long he'd remain at the front of the Republican primary field, I thought I'd see if I was also wrong in my suspicion that his donations were skewed toward whichever party was more popular at any given moment over the last 20 years. To the data!

Sum of all Donations, by Party

This is the first breakdown of total contributions over the last 20 or so years, by donations to Republicans, Democrats, and himself. Unfortunately, it has no context of time since I suspected his donation volume had increased substantially in the most recent year.

Sum of All Donations by Party, per year

This breaks down the total value of all his donations by party, per year. While it shows the recent uptick in spending, it suffers from a form of political inflation too. Donation values, for normative and legal reasons, have increased substantially across the board in the few recent cycles. So, to correct for that, I also looked at the total number of donations made...

Total Number of Donations by Party, per year

This better captures how far and wide The Donald has spread his political largesse over the past 20 years by looking at the total number of donations made to each party (and to himself).

Total Number of Donations by Party, per Election Cycle

Finally, to remove the cyclical nature of donations, I clumped the years together into election cycles, since, off-year donations tend to be lower than on-year donations.

This to me is the clearest story about his donations. There is clearly a shift in the nature of his partisan giving; and that shift comes pretty close to when President Obama's popularity started to decline. And the spike in donations to Democrats came pretty much at the same time as the big drop in President Bush's popularity.

One could interpret these results in two different ways; first, Donald Trump is an opportunist who simply changed his giving to reflect a change in where he viewed an opportunity to run (specifically, look at the big change between the 2006 election cycle and the current one). Alternatively, one could also think he's a dedicated partisan who stopped his political giving because he grew sick of holding the middle ground. I think that second scenario is unlikely, but, then again I've been wrong before.

Read more...


Mon 19 October 2015

An Intuitive Explanation of the OLS Estimator for both Traditional and Matrix Algebra

Posted by Mischa Fisher in Econometrics   

The Ordinary Least Squares estimator, \( \hat{\beta} \) is the first thing one learns in econometrics. It has two forms, one in standard algebra and one in matrix algebra, but it's important to remember the two are equivalent:

$$ \hat{\beta} = \frac{\hat{cov}(x,y)}{var(x)} = \mathbf{({X}'X)^{-1}{X}'Y} $$

I think most students will find it extremely easy to get lost in notation and miss the link to be made with real world data. The following exercise is a helpful way I found to make sure one continues to make the link between traditional 'simple' notation, Matrix Algebra notation, and the underlying data and arithmetic that goes into the ordinary linear regression estimator.

Deriving the Algebraic Notation for the Simple Bivariate Model

The familiar simple bivariate model is expressed as an independent observation as a function of an intercept, a regression coefficient, and an error term (respectively):

$$ y_{i} = b_{0} + b_{1}x_{1} + e_{i} $$

Where we wish to minimize the sum of squared errors (SSE):

$$ minimize: SSE = \sum_{i=1}^{N} e_{i}^{2} $$

To do so we isolate the error of the regression to make it a function of the other terms:

$$ e_{i} = y_{i} - b_{0} - b_{1}x_{1} $$

Then substitute:

$$ minimize: \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2} $$

For our purposes, we'll ignore the derivation of the intercept and take it as a given that it is \( \bar{y} - \hat{\beta_{1}}\bar{x} \) and just solve for the \( \hat{\beta} \) slope coefficient. To minimize the errors, we need to take the partial derivative with respect to \( b_{1} \)

$$ \frac{\partial SSE }{\partial b_{1}} = \frac{\partial }{\partial b_{1}} \left [ \sum_{i=1}^{N} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ] $$

Move the summation operator through since the the derivative of a sum is equal to the sum of the derivatives:

$$ \frac{\partial SSE }{\partial b_{1}} = \sum_{i=1}^{N} \left [ \frac{\partial }{\partial b_{1}} (y_{i} - b_{0} - b_{1}x_{1})^{2} \right ] $$

Take the derivative (using the chain rule), then setting it equal to 0 for the first order condition to find the min/max:

$$ \frac{\partial SSE }{\partial b_{1}} = -2 \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1}) = 0 $$

Then multiply by \( - \frac{1}{2} \) to simplify:

$$ 0 = \sum_{i=1}^{N} x_{i}(y_{i} - b_{0} - b_{1}x_{1}) $$

Substitute the solution for the intercept, \( b_{0} \), that we took as a given above:

$$ 0 = \sum_{i=1}^{N} x_{i}(y_{i} - (\bar{y} - \hat{\beta_{1}}\bar{x} ) - b_{1}x_{1}) $$

Then rearrange and distribute the summation operator to solve for \( \hat{\beta_{1}} \) :

$$ \hat{\beta_{1}} = \frac{\sum_{i=1}^{N} (y_{i} - \bar{y} )x_{i}}{ \sum_{i=1}^{N} (x_{i} - \bar{x})x_{i} } $$

Which is algebraically equivalent to:

$$ \frac{\hat{cov}(x,y ...

Read more...


Sun 13 September 2015

Customizing Pelican for Static and Dynamic Content

Posted by Mischa Fisher in Technology   

As is typical with these sorts of things, the online community has been very helpful in sorting out exactly how I wanted to customize this site when I recently rebuilt it using the Python static site generator Pelican.

A brief summary of a few changes I think were extremely helpful in tweaking the stock Pelican build:

Using Math

Mathjax seemed to be the simplest solution. Embedding the information via a short snippet of code linking to their CDN, and then using simple LaTeX notation within Markdown takes only a minute or two:

<script         src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>

Using Bootstrap for Responsive CSS

Erik Flowers Bootstrap grid introduction is a very nice resource (Click Here)

Customizing Your Own Theme

Web developer Robert Iwancz made a great bare bones framework on which one can build out almost anything. (Click Here for his VoidyNullness Theme)

Using Pelican for a Static Landing Page

Find your .md markdown file that you want to use as your home page content and add the following metadata to the top of the file: (the second change is to prevent your home page from appearing in the menu twice, depending on your theme).

save_as: index.html
status: hidden

Then in a separate .md file that will become your blog content, add the metadata identifier at the top to assign the appropriate HTML template from your templates folder:

Title: (Your Blog Title)
Date: (The Date)
Category: Page
Template: (the title of your blog template, without the .html extension

Finally, copy the content generating loops from the default page templates into your new blog html template:

{% block content_body %}
{% block article %}
{% if articles %}
  {% for article in (articles_page.object_list if articles_page else articles) %}
<article>
  {% for file in CUSTOM_INDEX_ARTICLE_HEADERS %}
    {% include "includes/" + file %}
  {% else %}
    {% include "includes/article_header.html" %}
  {% endfor %}

  {% if ARTICLE_FULL_FIRST is defined and loop.first and not articles_page.has_previous() %}
    <div class="content-body">
    {% if article.standfirst %}
      <p class="standfirst">{{ article.standfirst|e }}</p>
    {% endif %}
    {{ article.content }}
    {% include "includes/comments.html" %}
  </div>
  {% else %}
    {% include "includes/index_summary.html" %}
  {% endif %}
</article>

<hr />
  {% endfor %}
{% endif %}
{% endblock article %}

{% block pagination %}
<nav class="index-pager">
{% if articles_page and articles_paginator.num_pages > 1 %}
    <ul class="pagination">
    {% if articles_page.has_previous() %}
        <li class="prev">
          <a href="{{ SITEURL }}/{{ articles_previous_page.url }}">
        <i class="fa fa-chevron-circle-left fa-fw fa-lg"></i> Previous
      </a>
    </li>
{% else %}
    <li class="prev disabled"><span>
        <i class="fa fa-chevron-circle-left fa-fw fa-lg"></i> 
        Previous</span>
    </li>
{% endif %}

{% for num in articles_paginator.page_range %}
  {% if num == articles_page.number %}
    <li class="active"> <span>{{ num }}</span> </li>
  {% else %}
    <li>
      <a href="{{ SITEURL }}/{{ articles_paginator.page(num).url }}">{{ num }}</a>
    </li>
  {% endif %}
{% endfor %}

{% if articles_page.has_next() %}
    <li class="next">
      <a href="{{ SITEURL }}/{{ articles_next_page.url }}">
        Next <i class="fa fa-chevron-circle-right fa-fw fa-lg"></i>
      </a>
    </li>
{% else %}
    <li class="next disabled">
      <span><i class="fa fa-chevron-circle-right fa-fw fa-lg"></i> Next</span>
    </li>
{% endif %}
</ul>
{% endif %}

Hiding Pages from the Menu

Quite simple, just change the metadata on any given page's .md file to:

Status: Hidden

Embedding Data from a CSV

To generate my reading list, I ...

Read more...


Sat 12 September 2015

The Pros and Cons of Using a Static Site Generator

Posted by Mischa Fisher in Technology   

In a recent effort to re-gear this site toward the quantitative and away from the strictly artistic, I rebuilt the site from scratch with one singular aim: make posting effortless. In that spirit, I was directed by a good friend toward static site generators; a new development in the 8 or so years since I had last looked at any of the technologies surrounding web development.

With the new site up and running, here is a brief list of the pros and cons, as I see them, of using a static site generator:

Pros:

  1. Effortless posting:

Write posts in Markdown, then upload with a few keystrokes straight from the terminal.

  1. Cheap and scalable hosting:

I'm using Amazon's S3 for hosting, and Route 53 for DNS services. They're almost free in low traffic, and infinitely scalable in high traffic.

  1. No backend to maintain:

PHP, SQL, and the slow load times and unresponsiveness of shared servers on most hosting plans are a thing of the past. (I'm looking at you GoDaddy.com)

  1. Easy to backup or migrate:

I have all the website files in the location of my laptop being backedup to Dropbox in realtime. In addition, version control through something like Git is also handy, particularly when messing around with the underlying Python scripts that generate the site.

Cons:

  1. Steep learning curve:

The list of technologies one has to look at include: HTML, CSS, Python, JavaScript, Markdown, the Terminal, Pelican, FontAwesome, Jinja, Bootstrap, s3cmd, pip, brew, and virtual environments.

  1. Longer to get setup:

With a squarespace account you can be up and running in minutes. And it will look a lot prettier by default.

For me those were the biggest pros and cons I weighed in setting this site up. (Your mileage may vary.)

Read more...