Scrape for Words

I use python to scrape for words.

Scraping is actually misleading, as I didn't need to scrape from online. Just retrieved from a corpus of nltk.

A corpus is a large, exhaustive, collection of words and text used in a particular library. For example the Brown Corpus has more than million words created at Brown University, categorised by genre.

For this example, we will use the Project Gutenberg's Corpus, which contains words of a sample collection from Project Gutenberg, http://www.gutenberg.org/.
You may download your favourte classics from Project Gutenberg and load the text document accordingly.

import packages

First let's import the gutenberg corpus from the nltk package:

In [1]:
from nltk.corpus import gutenberg

Project Gutenberg Corpus

You may download your favourte classics from Project Gutenberg and load the text document : http://www.gutenberg.org/

In [2]:
gutenberg.fileids()
Out[2]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

To get the words, we just need to call the words() method of the gutenberg corpus:

In [3]:
wordlist = gutenberg.words('austen-emma.txt')
In [4]:
len(wordlist)
Out[4]:
192427

Collect, sort and save

This list has the exhaustive list of words, so we want the frequency of distinct words:

In [5]:
from nltk import FreqDist
In [6]:
frequency_list = FreqDist(wordlist)
In [7]:
len(frequency_list)
Out[7]:
7811

And sort it by frequency:

In [8]:
most_common = frequency_list.most_common()
In [9]:
most_common[:5]
Out[9]:
[(',', 11454), ('.', 6928), ('to', 5183), ('the', 4844), ('and', 4672)]

Once sorted, we can keep the sorted list and discard the frequency. So we just keep the first item of each tuple in the list using list comprehension method:

In [10]:
common_words = [i[0] for i in most_common]

And save it for use:

In [11]:
with open("words.txt", "w") as output:
    for item in common_words:
        output.write("%s\n" % item)

And thus we have a list of words ready for used, sorted by its frequency distribution from the most common words to the least.