eKonometrics: analytics

Showing posts with label analytics. Show all posts

Monday, May 20, 2019

Modern Data Science with R: A review

Some say data is the new oil. Others equate its worth to water. And then there are those who believe that data scientists will be (in fact, they already are) one of the most sought-after workers in knowledge economies.

Millions of data-centric jobs require millions of trained data scientists. However, the installed capacity of graduate and undergraduate programs in data science is nowhere near meeting this demand over the next many years.

So how do we produce data scientists?

Given the enormous demand for data scientists and the fixed supply from higher education institutions, it is quite likely that one must look beyond colleges and universities to train a large number of data scientists desired by the marketplace.

Getting trained on the job is one possible route. This will require repurposing the existing workforce. To prepare the current workforce for data science, one needs training manuals. One such manual is Modern Data Science with R (MDSR).

Published by the CRC Press (Taylor and Francis Group) and authored by three leading academics: Professors Baumer, Kaplan, and Horton, MDSR is the missing manual for data science with R. The book is equally relevant to data science programs in higher ed as it is to practitioners who would like to embark on a career in data science or to get a taste of an aspect of data science that they have not explored in the past.

As the book’s name suggests, the text is based on R, one of the most popular and versatile computing platforms. R is a freeware and is being developed by thousands of volunteers in real time. In addition to base R, which comes bundled with thousands of commands and functions, the user-written packages, whose number has exceeded 14,000 (as of May 2019), further expand the universe of features making R perhaps the most diverse computing platform.

Despite the immense popularity of data science, only a handful of titles focus exclusively on the topic. Hadley Wickham’s R for Data Science and R Cookbook by Paul Teetor are the other two other worthy texts. MDSR is unique in the sense that it serves as an introduction to a whole host of analytic techniques that are seldom covered in one title.

In the following paragraphs, I’ll discuss the salient features of the textbook. I begin with my favourite attribute of the book that deals with its organization. Instead of muddling with theories and philosophies, the book gets straight to business and starts the conversation with data visualization. A graphic is worth a thousand words, and MDSR is proof of it.

And since Hadley Wickham’s influence on data science is ubiquitous, MDSR also embraces Wickham’s implementation of Grammar of Graphics in R with one of the most popular R packages, ggplot2.

Another avenue where Wickham’s influence is widely felt is data wrangling. A suite of R packages bundled under the broader rubric of Tidyverse is influencing how data scientists manipulate small and big data. Chapter 4 in MDSR perhaps is one of the best and succinct introduction to data wrangling with R and Tidyverse. From the simplest to more advanced examples, MDSR equips the beginner with the basics and the advanced user with new ways to think about analyzing data.

A key feature of MDSR is that it’s not another book on statistics or econometrics with R. Yours truly is guilty of authoring one such book. Instead, MDSR is a book focused squarely on data manipulation. The treatment of statistical topics is not absent from the book; however, it’s not the book’s focus. It is for this reason that the discussion on Regression models is in the appendices.

But make no mistake, even when statistics is not the focus, MDSR offers sound advice on the practice of statistics. Section 7.7, The perils of p-values, warns the novice statisticians about not becoming the unsuspecting victims of hypothesis testing.

The books distinguishing feature remains the diversity of data science challenges it addresses. For instance, in addition to data visualization, the book offers an introduction to interactive data graphics and dynamic data visualization.

At the same time, the book covers other diverse topics, such as database management with SQL, working with spatial data, analyzing text-based (non-structured) data, and the analysis of networks. A discussion about ethics in data science is undoubtedly a welcome feature in the book.

The book is punctuated with hundreds of useful and hands-on data science examples and exercise, providing ample opportunities to put concepts to practise. The book’s accompanying website offers additional resources and code examples. At the time of this review, not all code was available for download.

Also, while I was able to reproduce more straightforward examples, I ran into trouble with complex ones. For instance, I could not generate advanced spatial maps showing flights origins and destinations.

My recommendation to authors will be to maintain an active supporting website because R packages are known to evolve, and some functionality may change or disappear over time. For instance, the mapping algorithms that are part of the ggmap package now require a Google maps API or else the maps will not display. This change has likely occurred after the book was published.

In summary, for aspiring and experienced data scientists, Modern Data Science with R is a book deserving to be in their personal libraries.

Murtaza Haider lives in Toronto and teaches in the Department of Real Estate Management at Ryerson University. He is the author of Getting Started with Data Science: Making Sense of Data with Analytics, which was published by the IBM Press/Pearson.

Monday, October 8, 2018

A question and an answer about recoding several factors simultaneously in R

Data manipulation is a breeze with amazing packages like plyr and dplyr. Recoding factors, which could prove to be a daunting task especially for variables that have many categories, can easily be accomplished with these packages. However, it is important for those learning Data Science to understand how the basic R works.

In this regard, I seek help from R specialists about recoding factors using the base R. My question is about why one notation in recoding factors works while the other doesn’t. I’m sure for R enthusiasts, the answer and solution are straightforward. So, here’s the question.

In the following code, I generate a vector with five categories and 300 observations. I convert the vector to a factor and tabulate it.

Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.

Where 1 and 6 correspond to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, would work, but it didn’t. It returns varying and erroneous answers.

However, when I refer to levels explicitly, the script works as intended. See the script below.

Hence the question: Why one method works and the other doesn’t? Looking forward to responses from R experts.

The Answer

lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on stackoverflow. The solution is simple. The following code works:

The problem with my approach, as explained by lebastsnok, is the following:

"levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters."

Wednesday, March 7, 2018

Is it time to ditch the Comparison of Means (T) Test?

For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different.

It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression.

My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models.

Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very paper that introduced the t-statistic under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test.

My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.

The following are my reasons:

1. Pedagogy related reasons:

a. Teaching Regression instead of other intermediary tests will save instructors considerable time that could be used to illustrate the same concepts with examples using Regression.

b. Given that there are fewer than 39 hours of lecture time available in a single semester introductory course on applied statistics, much of the course is consumed in covering statistical tests that would be redundant if one were to introduce Regression models sooner in the course.

c. Academic textbooks for undergraduate students in business, geography, psychology dedicate huge sections to a battery of tests that are redundant in the presence of Regression models.

i. Consider the widely used textbook Business Statistics by Ken Black that requires students and instructors to leaf through 500 pages before OLS Regression is introduced.

ii. The learning requirements of undergraduate and graduate students not enrolled in economics, mathematics, or statistics programs are quite different. Yet most textbooks and courses attempt to turn all students into professional statisticians.

2. Applied Analytics reasons

a. OLS Regression model with a continuous dependent variable and a dichotomous explanatory variable produces the exact same output as the standard Comparison of Means Test.

b. Extending the comparison to more than two groups is a straightforward extension in Regression where the dependent variable will comprise more than two groups.

i. In the tradition statistics teaching approach, one advises students that the T-test is not valid to compare the means for more than two groups and that we must switch to learning a new method, ANOVA.

ii. You might have caught on my drift that I am also proposing to replace teaching ANOVA in introductory statistics courses with OLS Regression.

c. A Comparison of Means Test illustrated as a Regression model is much easier to explain than explaining the output from a conventional T-test.

After introducing Normal and T-distributions, I would, therefore, argue that instructors should jump straight to Regression models.

Is Regression a valid substitute for T-tests?

In the following lines, I will illustrate that the output generated by OLS Regression models and a Comparison of Means test are identical. I will illustrate examples using Stata and R.

Dataset

I will use Professor Daniel Hamermesh’s data on teaching ratings to illustrate the concepts. In a popular paper, Professor Hamermesh and Amy Parker explore whether good looking professors receive higher teaching evaluations. The dataset comprises teaching evaluation score, beauty score, and instructor/course related metrics for 463 courses and is available for download in R, Stata, and Excel formats at:

https://sites.google.com/site/statsr4us/intro/software/rcmdr-1

Hypothetically Speaking

Let us test the hypothesis that the average beauty score for male and female instructors is statistically different. The average (normalized) beauty score for male instructors was -0.084 for male instructors and 0.116 for female instructors. The following Box Plot illustrates the difference in beauty scores. The question we would like to answer is whether this apparent difference is statistically significant.

I first illustrate the Comparison of Means Test and OLS Regression assuming Equal Variances in Stata.

Assuming Equal Variances (Stata)

Download data

use "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/teachingratings.dta"

encode gender, gen(sex) // To convert a character variable into a numeric variable.

The T-test is conducted using:

ttest beauty, by(sex)

The above generates the following output:

As per the above estimates, the average beauty score of female instructors is 0.2 points higher and the t-test value is 2.7209. We can generate the same output by running an OLS regression model using the following command:

reg beauty i.sex

The regression model output is presented below.

Note that the average beauty score for male instructors is -0.2 points lower than that of females and the associated standard errors and t-values (highlighted in yellow) are identical to the ones reported in the Comparison of Means test.

Unequal Variances

But what about unequal variances? Let us first conduct the t-test using the following syntax:

ttest beauty, by(sex) unequal

The output is presented below:

Note the slight change in standard error and the associated t-test.

To replicate the same results with a Regression model, we need to run a different Stata command that estimates a variance weighted least squares regression. Using Stata’s vwls command:

vwls beauty i.sex

Note that the last two outputs are identical.

Repeating the same analysis in R

To download data in R, use the following syntax:

url = "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/TeachingRatings.rda"

download.file(url,"TeachingRatings.rda")

load("TeachingRatings.rda")

For equal variances, the following syntax is used for T-test and the OLS regression model.

t.test(beauty ~ gender, var.equal=T)

summary(lm(beauty ~ gender))

The above generates the following identical output as Stata.

For unequal variances, we need to install and load the nlme package to run a gls version of the variance weighted least square Regression model.

t.test(beauty ~ gender)

install.packages(“nlme”)

library(nlme)

summary(gls(beauty ~ gender, weights=varIdent(form = ~ 1 | gender)))

The above generates the following output:

So there we have it, OLS Regression is an excellent substitute for the Comparison of Means test.

Monday, August 22, 2016

Five Questions about Data Science

From Safari Books Online (https://www.safaribooksonline.com/blog/2016/02/10/data-science-qa/)

---

Recently, we were able to ask five questions of Murtaza Haider, about the new book from IBM Press called “Getting Started with Data Science: Making Sense of Data with Analytics.” Below, the author talks about the benefits of data science in today’s professional world.

1. What are some examples of data science altering or impacting traditional professional roles already?

Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.

The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.

Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.

As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.

2. What steps can a professional take today to learn how and why to implement data science into their current role?

Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.

For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.

For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.

3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?

Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.

The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.

If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.

A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.

4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?

Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.

Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.

5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?

Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.

As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.

Monday, July 18, 2016

Data Science Boot Camp completed at Ryerson University

I am pleased to update you on the Data Science Boot Camp we ran at the Ted Rogers School of Management at Ryerson University in Toronto in collaboration with IBM’s www.BigDataUniversity.com. The 9-week long Boot Camp concluded on July 15.

We received a total of 1,137 registrations and the attendance ranged between 100 to 150 participants each week.

I have made the resources (software codes, PowerPoints, etc.) available online at https://sites.google.com/site/statsr4us/workshops/datascience. We recorded 24 hours of video, which we will be online soon.

I restricted the hands-on training to R, hence the Boot Camp serves as an introduction to analytics with R. You are welcome to share these resources.

A breakdown of weekly schedule is provided in the following hyperlinked list:

Week 0: Introduction & logistics

Week-1: Head first into data

Week-2: Data, data, & data

Week-3: The Grandfather of All Analytics

Week-4: Rinse & Repeat

Week-5: The Economics of Infidelity

Week-6: Time Series is Money

Week-7: Housing Bubbles

Week-8: Choice Metrics

Friday, October 30, 2015

Curious about big data in Montreal?

Are you in Montreal and curious about big data? Well here is your chance to attend a session about the same at Concordia University on Tuesday, Nov. 03 at 6:00 pm.

www.BigDataUniversity.com, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.

BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.

Further details are available at:

Getting started with Data Science and Introduction to Watson Analytics

http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/

When: Tuesday, November 3rd at 6-9 PM

Where: H1269, 12th floor of the Hall Bldg

(1455, blvd. De Maisonneuve ouest - Metro Guy-Concordia)

Monday, July 30, 2012

Big data, big analytics, big opportunity

Data, data, every where
Nor any byte to think

The world today is awash with data. Corporations, governments, and individuals are busy generating petabytes of data on culture, economy, environment, religion, and society. While data has become abundant and ubiquitous, data analysts needed to turn raw data into knowledge are in fact in short supply.

With big data comes big opportunity for the educated middle class in the developing world where an army of data scientists can be trained to support the offshoring of analytics from the western countries where such needs are unlikely to be filled from the locally available talent.

In a 2011 report, McKinsey Global Institute revealed that the United States alone faces a shortage of almost 200,000 data analysts. The American economy requires an additional 1.5 million managers proficient in decision-making based on insights gained from the analysis of large data sets. And even when Hal Varian, Google’s famed chief economist, profoundly proclaimed that “the real sexy job in 2010s is to be a statistician,” there were not many takers for the opportunity in the West where students pursuing degrees in statistics, engineering, and other empirical fields are small in number and are often visa students from abroad.

A recent report by Statistics Canada revealed that two-thirds of those who graduated with a PhD in engineering from a Canadian University in 2005 spoke neither English nor French as mother tongue. Similarly, four out of 10 PhD graduates in computers, mathematics, and physical sciences did not speak a western language as mother tongue. Also, more than 60 per cent of engineering graduates were visible minorities, suggesting that the supply chain of highly qualified professional talent in Canada, and to a large extent in North America, is already linked to the talent emigrating from China, Egypt, India, Iran, and Pakistan.

The abundance of data and the scarcity of analysts present a unique opportunity for developing countries, which have an abundant supply of highly numerate youth who could be trained and mobilized en masse to write a new chapter in offshoring. This would require a serious rethink for thought leaders in developing countries who have not taxed their imaginations beyond thinking of policies to create sweat shops where youth would undersell their skills and see their potential wilt away while creating undergarments for consumers in the west. The fate of the youth in developing countries need not be restricted to stitching underwear or making cold calls from offshored call-centers in order for them to be part of the global value chains. Instead, they can be trained as skilled number-crunchers who would add value to otherwise worthless data for businesses, big and small.

A multi-billion dollar industry

The past decade has witnessed a major change in the sectorial evolution of some very large manufacturing firms known in the past for mostly hardware engineering and now evolving into firms delivering services, such as business analytics. Take IBM for example, which specialized as a computer hardware company producing servers, desktop computers, laptops, and other supporting infrastructure. That was IBM’s past. Today, IBM is focused on analytics. It is spending hundreds of millions of dollars in advertising, trying to rebrand itself as a leader in business analytics. In fact, it has divested from several hardware initiatives, such as manufacturing laptops, and has instead spent billions in acquisitions to build its analytic credentials. For instance, IBM has acquired SPSS for over a billion dollars to capture the retail side of the Business analytics market. For large commercial ventures, IBM acquired Cognos to offer full service analytics.

In 2011 alone, the business analytics software market was worth over $30 billion. Oracle ($6.1bn), SAP ($4.6 bn), IBM ($4.4 bn), and Microsoft and SAS each with $3.3 bn in sales led the market. It is estimated that the sale of business analytics software alone will hit $50 billion by 2016. Dan Vesset of IDC, a company specializing in watching industry trends, aptly noted that business analytics had “crossed the chasm into the mainstream mass market” and the “demand for business analytics solutions is exposing the previously minor issue of the shortage of highly skilled IT and analytics staff.”

In addition to the bundled software and service sales offered by the likes of Oracle and IBM, business analytics services in the consulting domain generated several billion dollars more worldwide. While the large firms command the lion’s share in the analytics market, the billions left at the bottom are still a large enough prize to take the analytics plunge.

Several billion reasons to hop on the analytics bandwagon

While the IBMs of the world are focused largely on large corporations, the analytics needs for small and medium-sized enterprises (SMEs) are unlikely to be met by IBM, Oracle, or other large players. Cost is the most important determinant. SMEs prefer to have analytics done on the cheap while the overheads of the large analytics firms run into millions of dollars thus pricing them out of the SME market. With offshoring comes the access to affordable talent in developing countries who can bid for smaller contracts and beat the competition in the West on price, and over time on quality as well.

The trick therefore, is to beat the IBMs of the world in the analytics game by not competing against them. Realizing that business analytics is not a market, but an amalgamation of several types of markets focused on delivering value-added services involving data capture, data warehousing, data cleaning, data mining, and data analysis, developing countries can carve out a niche for themselves by focusing exclusively on contracts that large firms will not bid for because of their intrinsic large overheads.

Leaving the fight for top dollars in analytics to top dogs, a cottage industry in analytics could be developed in the developing countries that may strive to serve the analytics need of SMEs. Take the example of the Toronto Transit Commission (TTC), Canada’s largest public transit agency with annual revenues exceeding a billion dollars. When TTC needed to have a large database of almost a half million commuter complaints analyzed, it turned to Ryerson University, rather than a large analytics firm. TTC’s decision to work with Ryerson University was motivated by two considerations. First the cost; as a public sector university, Ryerson believes strongly in serving the community and thus offered the services for gratis. The second reason is quality. Ryerson University, like most similar institutions of higher learning, excels in analytics where several faculty members work at the cutting edge of analytics and are more than willing to apply their skills to real life problems.

Why now?

The timing had never been better to undertake such an endeavor on a very large scale. The innovations in Information and Communication Technology (ICT) and the ready availability of the most advanced analytics software as freeware allows entrepreneurs in developing countries to compete worldwide. The Internet makes it possible to be part of global marketplaces with negligible costs. With cyber marketplaces such as Kijiji and Craigslist individuals can become proprietors offering services worldwide.

Using the freely available Google Sites, one can have a business website online immediately at no cost.Google Docs, another free service from Google, allows one to have a web server for free to share documents with collaborators or the rest of the world for free. Other free services, such as Google Trends, allow individual researchers to generate data on business and social trends without needing subscriptions for services that cost millions. The graph below is generated using Google trends showing daily visits to the websites of leading analytics firms. Without free access to such services, access to the data used to generate the same graph would carry a huge price tag.

Similarly, another free service from Google allows one to determine, for instance, which cities registered the highest number of search requests for ‘business analytics’. It appears that four of the top six cities where analytics are most popular are located in India, which is evident from the following graph where search intensity is mapped on a normalized index of 0 to 100.

The other big development of recent times is freeware that is leveling the playing field between haves and have-nots. In analytics, one of the most sophisticated computing platforms is R, which is available for free. Developers worldwide are busy developing the R platform, which now offers over 3,000 packages for free for analyzing data. From econometrics to operations research, R is fast becoming the lingua franca for computing. R has evolved from being popular just amongst computing geeks to having its praise sung by the New York Times.

R has also made some new friends, especially Paul Butler, a Canadian student who became a worldwide sensation by mapping the geography of Facebook. While being an intern at Facebook, Paul analyzed gigabytes of data to plot how Facebook’s friends were linked globally. His map (see the image below) became an instant hit worldwide and has been reproduced in publications thousands of times. If you are wondering what software Paul used to generate the map, wonder no more, the answer is R.

R is fast becoming the preferred computing platform for data scientists worldwide. For decades the data analysis market was ruled by the likes of SAS, SPSS, Stata and other similar players. R has taken over the imagination of data analysts as of late who are fast converging to R, especially after R’s ability to interact with Hadoop (another open source platform) for analyzing big data . In fact, most innovations in statistics are first coded in R so that the algorithms become available to all immediately and for free.

Source: http://r4stats.com/articles/popularity/

The fact that R is freely available should not be taken lightly. A commercial license of a similarly equipped version of SPSS may cost up to US$7,500. The other big advantage of using R is the fact that thousands of training documents on the Internet and videos on YouTube are also available for free by volunteers.

Where to next

The private sector has to take the lead for business analytics to take root in developing countries. The governments could also have a small role in regulation. However, the analytics revolution has to take place not because of the public sector, but in spite of it. Even public sector universities in developing countries cannot be entrusted with the task where senior university administers do not warm up to innovative ideas unless they involve a junket in Europe or North America. At the same time the faculty in public sector universities in developing countries is often unwilling to try new technologies.

The private sector in developing countries may want to launch first an industry group that takes upon the task of certifying firms and individuals interested in analytics for quality, reliability, and ethical and professional competencies. This will help build confidence around national brands. Without such certification, foreign clients will be apprehensive to share their proprietary data with individuals hidden behind computer monitors thousands of miles away.

The private sector will also have to take the lead in training a professional workforce in analytics. Several companies train their employees in the latest technology and then market their skills to clients. The training houses would therefore also double as consulting practices where the best graduates may be retained as consultants.

Small virtual marketplaces could be setup in large cities where clients can put requests for proposals and pre-screened, qualified bidders can compete for the contract. The national self-regulating body will be responsible for screening qualified bidders from its vendor-of-record database, which it would make available to clients globally through the Internet.

The IBMs of the world see the analytics market to hit hundreds of billions in revenue in the next decade. The abundant talent in developing countries can be polished into a skilled workforce to tap into the analytics market to channel some revenue to developing countries while creating gainful employment opportunities for the educated youth who have been reduced to making cold calls from offshored call centers.

Thursday, January 6, 2011

In praise of the article

As a non-native speaker of English language, I have always struggled with the elusive article, especially ‘the’. When should ‘the’ be used is not intuitive to me. Therefore, I rely on rules to determine when to use an article.

Over the years one should not expect any change in the frequency of use of articles in English language. However, one could observe a significant decline in the use of the definite article (the) in American and British English. See the graph below, which shows that in American English the definite article ‘the’ represented 5.5% of the words used in the books published in English in the United States. These are the books scanned by Google as part of its initiative to digitize every published book. However, one sees a decline in the use of the article ‘the’ starting in 1970s. I wonder why. Is the language referring more to proper nouns and hence the decline in ‘the’. Also ‘the’ has been used much more frequently than ‘a’ or ‘an’.

The books published in English in England and scanned by Google present almost a similar trend, which is visible in the graph below.

Data through the history

Analytics and data are becoming ubiquitous in finance, politics, and other spheres of life, such as friendships where people now boast about how many friends they have on Facebook. It was however not very long ago that the word data was not even part of the everyday lexicon. See the graph below, which shows the evolution of the word data over the past 100 years in the books digitized by Google. The graph immediately below is that of word data used in books published in English in the United States. The y-axis presents the share of the word data in a given year as a percentage of all words published in books in that particular year.

Data saw an earlier increase in its mention in 1920s in American English. However, it was only in the 1960s when the use of data become more pronounced and remained so until mid 1980s. It was the period when Robert McNamara, the most prominent of all quants, tried to win a war in Vietnam by improving the analytics. He failed. A decline in its mention is observed 1990s and then a quick reversal with a rapid increase in its mention from late 1990s to the first few years of the new millennium. The decline continues again in the mention of the word data.

The graph below shows the same for books in English that were published in England. The decline in its mention in the past decade seems to be levelling off in the UK.

Saturday, December 25, 2010

Using statistics to understand armed conflict

Drew Conway, a doctoral student in New York, uses statistical analysis to make sense of armed conflicts. Pasted below is his graphic that he developed from analyzing Wikileaks data about Afghanistan in July 2010. He used R software to generate the graphic.

The gold standard in newsroom graphics: The New York Times

Watch Amanda Cox explain how The New York Times uses the graphics in the print and online edition. The New York Times has been at the cutting edge of using data and graphics. The hour-long video is worth watching for any one interested in using data to communicate.

http://newmediadays.dk/amanda-cox

Forbes magazine recognizes R

R is on a role. First it was the New York Times, than the Facebook, and now Forbes.

Friday, December 24, 2010

Data journalism at its best

Watch how journalists at Guardian used the data to paint a hitherto hidden picture of the Afghan war.

http://www.guardian.co.uk/world/datablog/interactive/2010/jul/26/ied-afghanistan-war-logs

Pages