Jordan Weissmann and “New York rich”

First, a quick formality: I’m blogging about bad/misleading/less-informative-than-they-might-have-been graphs with my friend Catherine Roberts at her new fun internet adventure, Graphchat. There will be a lot of graphs, so you can probably look at it while at work and no one will be able to tell, assuming your work has something to with graphs.

ANYWAY:

Jordan Weissman looked at income distribution in the U.S. and New York, targeting the notion of “New York rich.” In his words:

it’s usually a somewhat strange experience when I get into conversations here about class. If I mention that a six-figure salary counts as rich in much of the country—that just $250,000 gets you into the top 2 percent—the response is usually, “Sure, but that’s not New York rich.”

Except, it sort of is.

I won’t quibble here about the claim, but rather about its graphical presentation. Weissman picked this side-by-side histogram to make his point:

New York city is visibly heavier in very low income and very high income, but the magnitudes of these differences aren’t exactly illustrative. If the claim is that the distribution of income in NYC is similar to the distribution of income in the U.S. at large, presenting the data this way with tons of white space between income buckets that aren’t really doing anything is confusing. Additionally, if the claim is in particular about “New York rich,” it doesn’t seem to matter if the proportion of households in the $50,000 to $74,999 bucket is similar in the U.S. and NYC, but rather how far out the right tail extends seems to be more instructive. You can’t figure that out from this graphic.

Things get worse after Weissman adds a claim that San Francisco is really the rich people city. To make the claim that New York isn’t special in one particular way and San Francisco is, Weissman threw in this graph comparing San Francisco to… the U.S.:

If you can remember where the red bar is in the previous image, you can do a really quick visual comparison and find out whether New York and San Francisco are alike or similar in different income buckets! Lest you think I’m being unfair by throwing two paragraphs in between the two graphs, click the link I opened with and compare the space between the graphs in Weissman’s article (or here it is again if you don’t like scrolling) with the space between the graphs here.

The issues here are:

  1. To compare NYC and San Francisco, Weissman showed both of them compared to a third entity
  2. If income distribution is lumpy within any of these brackets, we don’t have any sense of the underlying shape

Neither of these needs to be a problem. We can instead visualize the same data (I used the one year ACS instead of 5 year) in KDE plots, fit everything on one graph, and actually see what San Francisco and NYC look like relative to each other and to the national average:

Here, it’s much more obvious how right-skewed the San Francisco income distribution is, much more obvious how similar the U.S. and NYC income distributions are, and much more obvious just how far the right tail goes. I don’t know what to tell you about the weird squiggle all the way in the right tail.

As I said above, I have no issues with the claim Weissman made. He’s correct that the income distributions in the U.S. and NYC are similar and that San Francisco is, if you pick random citizens, richer than either. His graphical choices just didn’t make that obvious.

As usual, all code/data work (there’s really not a lot) on github.

Zipf’s Law and IPython Notebooks

Bill Gardner is blowing the reproducible research horn again, which inspired me actually to bother learning how IPython notebooks worked. I probably should have started using IPython notebooks a while ago, given the tools available and my physical proximity (and academic irrelevance) to the HAP/Reinhart-Rogoff adventure. Oh well, get ’em next time.

Anyway, the Think Complexity book continues to be a font of interesting exercises and reasons to write python scripts. Current chapter I’m on starts with something called Zipf’s Law. Zipf’s Law is an empirical phenomenon that claims for a given large corpus of words in a language, counting the frequencies of those words, you can estimate how frequently a word will appear given only its rank along the lines of:

Log(f) = Log(c) – s Log(r)

where f is frequency, r is rank, and s and c are parameters that depend on language. That’s a simple linear equation though, so with a large body of words it’s not hard to estimate. I played with Adam Smith, and it worked ok, which is strange. Here’s what I wound up with (code below):

Zipf_fig

 Code:

//

Section 1: Zipf’s Law

Problem 5.1

Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency. You can test it by downloading an out-of- copyright book in plain text format from Project Gutenberg . You might want to remove punctuation from the words.

Plot the results and check whether they form a straight line. For plotting suggestions, see Section 3.6. Can you estimate the value of s?

    Imports:

  • nltk: Natural Language Toolkit. Useful for playing with text, especially large bodies of text.
  • re: python regular expressions library
  • pprint: pretty print. Makes long lists less ugly.
  • word_tokenize: tokenizes strings
  • urllib2: because we’re going to need some web data
In [1]:
import nltk
import re
import pprint
from nltk import word_tokenize
import urllib2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We’re going to need some text, so we’ll grab Adam Smith’s Wealth of Nations from Project Gutenberg.

In [2]:
url = 'http://www.gutenberg.org/cache/epub/3300/pg3300.txt'
response = urllib2.urlopen(url)
raw = response.read().decode('utf8')

There’s a bunch of license info though, so we’ll use find and reverse find on some stock Gutenberg labels to figure out where that is and then slice the string to between those positions.

In [3]:
start = raw.find('*** START OF THIS PROJECT GUTENBERG EBOOK')
end = raw.rfind('*** END OF THIS PROJECT GUTENBERG EBOOK')
raw = raw[start:end]

Lowercased:

In [4]:
tokens = np.array([w.lower() for w in word_tokenize(raw)])
vocab = set(tokens)

So now we have a bunch of words sorted alphabetically, and they all came from The Wealth of Nations. The goal, though, is to see whether we can replicate/observe Zipf’s Law:

Log(f) = Log(c) – s Log(r)

where f is the frequency of a word, r is the word’s rank, and c and s are parameters that depend on the language. This is obviously a linear form, and should appear that way when plotted.

For this we need to count the frequencies of each word.

Vocab is all of the unique words in the corpus, so we can use a dict comprehension with vocab as keys and the length of the array of words (tokens) equal to each key (this takes some time). Then, just to inspect, we can print the first few items and their counts.

In [5]:
counts = {v: len(tokens[tokens == v]) for v in vocab}
In [6]:
print [{key: val} for key, val in counts.items()[:10]]
[{u'writings': 1}, {u'delusions.': 1}, {u'portugal.': 9}, {u'foul': 1}, {u'four': 188}, {u'prices': 70}, {u'woods': 6}, {u'thirst': 1}, {u'ocean.': 1}, {u'preface': 2}]

We then need to grab ranks of each word. This is easier in a pandas DataFrame than in a dict.

In [7]:
df = pd.DataFrame(index = counts.keys(),
                  columns = ['Count','Rank'],
                  dtype = 'float64')

for a in df.index:
    df.loc[a,'Count'] = counts[a]
df['Rank'] = df['Count'].rank(method = 'min', ascending = False)

Then finally we estimate s and plot things

In [8]:
x = np.log(df['Rank'])
y = np.log(df['Count'])

fit = np.polyfit(x, y, deg = 1)
fitted = fit[0] * x + fit[1]
print 'Estimate of s is:\n{0}'.format(fit[0],3)
Estimate of s is:
-1.57041819147

In [9]:
fig = plt.Figure(figsize = (4,4), facecolor = 'w', edgecolor = 'w')
ax = plt.subplot(111)

ax.plot(x, y, 'bo', alpha = 0.5)
ax.plot(x,fitted,'r')

ax.set_title('Log(Rank) vs. Log(Count) Words,\nAdam Smith\'s Wealth of Nations')
ax.set_xlabel('Log(Rank)')
ax.set_ylabel('Log(Count)')

ax.set_xlim(left = max([min(np.log(df['Rank'])) * 0.95,0]))

plt.tight_layout()
plt.show()