Tuesday, April 5, 2011

Painting a picture of statistical packages

Imagine you have to analyze text comprising 18,000 words. You have to identify the most commonly cited ideas or words in the text and then present the analysis in a graphic format. There are sophisticated tools out there to help you with this task, but then again there is a tight deadline. You have fewer than five minutes to accomplish the task.

Generating a word cloud from the text may be one option. It is fast and the resulting output is appealing as well as informative. See the word cloud below, which I have generated from the description of 2,948 R packages listed at http://cran.r-project.org/web/packages. The one-liner description of these packages ran into 18,000-plus words. By using the free word cloud tool Wordle (http://www.wordle.net/), the task was accomplished in less than two minutes.


Based on the cloud we can see that the most frequent recurring themes in R packages are data, functions, models, estimation, regression, and Bayesian.

Wordle offers some control over the output. Consider the above cloud that was generated using the most common 150 words in the text. I eliminated ‘Analysis’ from the text since it was the most frequently repeated text. Later, I restricted the cloud to 100 most repeated words and removed restriction on  the word ‘Analysis’, and a randomly generated a word cloud. See the output below.


Notice the two variants of the word ‘data’ in the cloud. Wordle allows the user to eliminate any word in the generated cloud with a click of a mouse and retain the cleaned version of the cloud.

Also, don’t miss Drew Conway’s blog on building a more intelligent word clouds at http://www.drewconway.com/zia/?p=2624.

1 comment:

  1. This is impressive.
    I would have liked to reproduce a similar experience, but I have been unable to use Wordle to look at the webpages I wanted.

    Did you simply give the link http://cran.r-project.org/web/packages in the Create page of Wordle ?