Zipf’s Law and IPython Notebooks

Bill Gardner is blowing the reproducible research horn again, which inspired me actually to bother learning how IPython notebooks worked. I probably should have started using IPython notebooks a while ago, given the tools available and my physical proximity (and academic irrelevance) to the HAP/Reinhart-Rogoff adventure. Oh well, get ’em next time.

Anyway, the Think Complexity book continues to be a font of interesting exercises and reasons to write python scripts. Current chapter I’m on starts with something called Zipf’s Law. Zipf’s Law is an empirical phenomenon that claims for a given large corpus of words in a language, counting the frequencies of those words, you can estimate how frequently a word will appear given only its rank along the lines of:

Log(f) = Log(c) – s Log(r)

where f is frequency, r is rank, and s and c are parameters that depend on language. That’s a simple linear equation though, so with a large body of words it’s not hard to estimate. I played with Adam Smith, and it worked ok, which is strange. Here’s what I wound up with (code below):

Zipf_fig

 Code:

//

Section 1: Zipf’s Law

Problem 5.1

Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency. You can test it by downloading an out-of- copyright book in plain text format from Project Gutenberg . You might want to remove punctuation from the words.

Plot the results and check whether they form a straight line. For plotting suggestions, see Section 3.6. Can you estimate the value of s?

    Imports:

  • nltk: Natural Language Toolkit. Useful for playing with text, especially large bodies of text.
  • re: python regular expressions library
  • pprint: pretty print. Makes long lists less ugly.
  • word_tokenize: tokenizes strings
  • urllib2: because we’re going to need some web data
In [1]:
import nltk
import re
import pprint
from nltk import word_tokenize
import urllib2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We’re going to need some text, so we’ll grab Adam Smith’s Wealth of Nations from Project Gutenberg.

In [2]:
url = 'http://www.gutenberg.org/cache/epub/3300/pg3300.txt'
response = urllib2.urlopen(url)
raw = response.read().decode('utf8')

There’s a bunch of license info though, so we’ll use find and reverse find on some stock Gutenberg labels to figure out where that is and then slice the string to between those positions.

In [3]:
start = raw.find('*** START OF THIS PROJECT GUTENBERG EBOOK')
end = raw.rfind('*** END OF THIS PROJECT GUTENBERG EBOOK')
raw = raw[start:end]

Lowercased:

In [4]:
tokens = np.array([w.lower() for w in word_tokenize(raw)])
vocab = set(tokens)

So now we have a bunch of words sorted alphabetically, and they all came from The Wealth of Nations. The goal, though, is to see whether we can replicate/observe Zipf’s Law:

Log(f) = Log(c) – s Log(r)

where f is the frequency of a word, r is the word’s rank, and c and s are parameters that depend on the language. This is obviously a linear form, and should appear that way when plotted.

For this we need to count the frequencies of each word.

Vocab is all of the unique words in the corpus, so we can use a dict comprehension with vocab as keys and the length of the array of words (tokens) equal to each key (this takes some time). Then, just to inspect, we can print the first few items and their counts.

In [5]:
counts = {v: len(tokens[tokens == v]) for v in vocab}
In [6]:
print [{key: val} for key, val in counts.items()[:10]]
[{u'writings': 1}, {u'delusions.': 1}, {u'portugal.': 9}, {u'foul': 1}, {u'four': 188}, {u'prices': 70}, {u'woods': 6}, {u'thirst': 1}, {u'ocean.': 1}, {u'preface': 2}]

We then need to grab ranks of each word. This is easier in a pandas DataFrame than in a dict.

In [7]:
df = pd.DataFrame(index = counts.keys(),
                  columns = ['Count','Rank'],
                  dtype = 'float64')

for a in df.index:
    df.loc[a,'Count'] = counts[a]
df['Rank'] = df['Count'].rank(method = 'min', ascending = False)

Then finally we estimate s and plot things

In [8]:
x = np.log(df['Rank'])
y = np.log(df['Count'])

fit = np.polyfit(x, y, deg = 1)
fitted = fit[0] * x + fit[1]
print 'Estimate of s is:\n{0}'.format(fit[0],3)
Estimate of s is:
-1.57041819147

In [9]:
fig = plt.Figure(figsize = (4,4), facecolor = 'w', edgecolor = 'w')
ax = plt.subplot(111)

ax.plot(x, y, 'bo', alpha = 0.5)
ax.plot(x,fitted,'r')

ax.set_title('Log(Rank) vs. Log(Count) Words,\nAdam Smith\'s Wealth of Nations')
ax.set_xlabel('Log(Rank)')
ax.set_ylabel('Log(Count)')

ax.set_xlim(left = max([min(np.log(df['Rank'])) * 0.95,0]))

plt.tight_layout()
plt.show()
Advertisements