Wednesday, January 13, 2016

Getting Started with Data Science: Storytelling with Data

Earlier this month, IBM Press and Pearson have published my book titled: Getting Started with Data Science: Making Sense of Data with Analytics. You can download sample pages, including a complete chapter. There are 104 pages in the sample. You can also watch a brief interview about the book recorded earlier at the IBM Insight2015 Conference.

The very purpose of authoring this book was to rethink the way we have been teaching statistics and analytics to students and practitioners. It is no secret that most students required to take the mandatory stats course dislike it. I believe it has something to do with the way we have been teaching the subject than to do with the aptitude of our students. Furthermore, I believe there is a greater opportunity to equip the students with the skills needed in a world awash with data where competing on analytics defines the real competitive advantage.

No wonder, the latest issue of the leading publication on the subject, The American Statistician, is dedicated to reimagining how statistics should be taught in the undergraduate curriculum. The editors noted:
“We hope that this collection of articles as well as the online discussion provide useful fodder for further review, assessment, and continuous improvement of the undergraduate statistics curriculum that will allow the next generation to take a leadership role by making decisions using data in the increasingly complex world that they will inhabit.”
I am confident that my book will do its small part in equipping the next generation of students with the kind of skills needed to succeed in a data-centric world. For one, I have taken a storytelling approach to statistics. This book reinforces the point that data science and analytics training should be applied rather than theoretical, and the ultimate purpose of producing or consuming statistical analysis is to tell fascinating stories from it. Therefore, the book opens with the chapter titled, The Bazaar of Storytellers.

Who is this book for?

While the world is awash with large volumes of data, inexpensive computing power, and vast amounts of digital storage, the skilled workforce capable of analyzing data and interpreting it is in short supply. A 2011 McKinsey Global Institute report suggests that “the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”

Getting Started with Data Science (GSDS) is a purpose-written book targeted at those professionals who are tasked with analytics, but they do not have the comfort level needed to be proficient in data-driven analytics. GSDS appeals to those students who are frustrated with the impractical nature of the prescribed textbooks and are looking for an affordable text to serve as a long-term reference. GSDS embraces the 24-7 streaming of data and is structured for those users who have access to data and software of their choice, but do not know what methods to use, how to interpret the results, and most importantly how to communicate findings as reports and presentations in print or on-line.

GSDS is a resource for millions employed in knowledge-driven industries where workers are increasingly expected to facilitate smart decision-making using up-to-date information that sometimes takes the form of continuously updating data.

At the same time, the learning-by-doing approach in the book is equally suited for independent study by senior undergraduate and graduate students who are expected to conduct independent research for their coursework or dissertations.

Praise for the book

I am also pleased to share with you the praise for my book by Dr. Munir Sheikh, Canada’s former chief statistician:
“The power of data, evidence, and analytics in improving decision-making for individuals, businesses, and governments is well known and well documented. However, there is a huge gap in the availability of material for those who should use data, evidence, and analytics but do not know how. This fascinating book plugs this gap, and I highly recommend it to those who know this field and those who want to learn.”
— Munir A. Sheikh, Ph.D., Distinguished Fellow and Adjunct Professor at Queen’s University

Tom Davenport, author of the bestselling books Competing on Analytics and Big Data @ Work.has the following to say about my book:
“A coauthor and I once wrote that data scientists held ‘the sexiest job of the 21st century.’ This was not because of their inherent sex appeal, but because of their scarcity and value to organizations. This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most importantly it teaches how to tell a story with data.”
—Thomas H. Davenport, Distinguished Professor, Babson College; Research Fellow, MIT.

Dr. Patrick Surry
, Chief Data Scientist at had the following to say:
“This book addresses the key challenge facing data science today, that of bridging the gap between analytics and business value. Too many writers dive immediately into the details of specific statistical methods or technologies, without focusing on this bigger picture. In contrast, Haider identifies the central role of narrative in delivering real value from big data.

“The successful data scientist has the ability to translate between business goals and statistical approaches, identify appropriate deliverables, and communicate them in a compelling and comprehensible way that drives meaningful action. To paraphrase Tukey, ‘Far better an approximate answer to the right question, than an exact answer to a wrong one.’ Haider’s book never loses sight of this central tenet and uses many realworld examples to guide the reader through the broad range of skills, techniques, and tools needed to succeed in practical data-science. “Highly recommended to anyone looking to get started or broaden their skillset in this fast-growing field.”
And finally, Professor Atif Mian, author of the best-selling book: The House of Debt offered the following assessment:
“We have produced more data in the last two years than all of human history combined. Whether you are in business, government, academia, or journalism, the future belongs to those who can analyze these data intelligently. This book is a superb introduction to data analytics, a must-read for anyone contemplating how to integrate big data into their everyday decision making.”
— Professor Atif Mian, Theodore A. Wells ’29 Professor of Economics and Public Affairs,
Princeton University; Director of the Julis-Rabinowitz Center for Public Policy and Finance at the Woodrow Wilson School.

Sunday, December 6, 2015

Not so sweet sixteen!

In the world of big data and real-time analytics, Microsoft users are still living with the constraints of the bygone days of little data and basic numeracy.

If you happen to use Microsoft Excel for running Regressions, you will soon realize your limits:  The Windows version of Excel 2013 permits no more than 16 explanatory variables.

Excel has made great progress in expanding its capabilities in the recent past. Unlike the few thousand rows in the past, the current version permits about a million rows per Sheet (a single data set). But when it comes to regression, you may have several thousand observations in the data set, you are still limited by a hard constraint of sixteen explanatory variables.

Some would argue that for parsimony, we should be content with the restriction. True, but with categorical variables, the number of explanatory variables stretch beyond the artificial constraints set by Microsoft Excel.

Others might inquire why do statistical analyses in Excel in the first place. Despite the inherent limitations in Microsoft Excel, business schools in particular and other social science undergraduate programs in general, are increasingly turning to Excel to teach courses in statistics. If you were to take a quick look at the curriculum of the undergraduate business and numerous MBA programs, you would realize how widespread is the use of Excel for courses in statistics and analytics.

At Ryerson University, I switched to R years ago for my MBA courses. Thanks to John Fox’s R Commander, the transition to R was without much hassle. The students were told in the very beginning that they were now part of the big league, and hiding behind spreadsheets was no longer an option.

I must mention that Microsoft Excel continues to be my platform of choice for a variety of tasks. I use Excel several times a day, but not for statistical analysis. I am not suggesting that Excel cannot do statistics; I am arguing that it can do a much better job of it.

As I see it, Microsoft has several options. First is do nothing. After all, Microsoft Excel has no real competition in the Windows environment. Second, it could turn to the team that has programmed the linest function in Excel and ask them to add some muscle to it. That will be the wrong approach.

Instead, Microsoft should explore ways to integrate R or another freeware with Excel to add a complete analytics menu. Microsoft should learn from what the leaders in analytics are already doing. SPSS, an industry leader in analytics category, has already integrated R, allowing the SPSS users to merge the robust data management strengths of SPSS with the state-of-the-art analytics bundled with R. SAS, another big name in analytics, is about to do the same.

And since Microsoft has recently acquired Revolution R, it makes even more sense to build a bridge between Excel and Revolution R Open (RRO).

R Through Excel is one example of integrating R with Excel. If Microsoft were to put its weight behind the initiative, it could build a seamless coupling with R expanding the analytic capabilities for hundreds of million Excel users.

As for the SPSS, I recommend they also consider another option. If Microsoft were to integrate RRO with Excel, they could acquire an advanced analytics software and integrate it with SPSS. For this option, I would recommend Limdep, which I have found to be the most diverse software for statistical analysis and econometrics. Even though R is a collective effort of thousands of software developers, Limdep offers numerous routines and post-estimation options that are not available in the thousands of R packages. SPSS integrated with Limdep could become the most diversely capable commercial software in the market as it will bridge the gap with SAS and Stata.

As for the colleagues in business faculties pondering over what platform to adopt for the analytics/software courses, I would say know your limits, especially with Microsoft Excel while deciding upon the curriculum.

Friday, October 30, 2015

Curious about big data in Montreal?

Are you in Montreal and curious about big data? Well here is your chance to attend a session about the same at Concordia University on Tuesday, Nov. 03 at 6:00 pm., which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.

BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.

Further details are available at:

Getting started with Data Science and Introduction to Watson Analytics

When: Tuesday, November 3rd at 6-9 PM

Where: H1269, 12th floor of the Hall Bldg 
(1455, blvd. De Maisonneuve ouest - Metro Guy-Concordia)

Wednesday, May 20, 2015

Are Canadian newspapers painting false pictures with data?

The Canadian newspaper, Globe and Mail, is a leader in diction and style, but it may need improvement in the ‘grammar of graphics’.

Globe’s recent depiction of metropolitan economic growth in the series Off the Charts was way off the mark. The chart plotted the current and forecasted GDP growth rates for select cities in Canada. The red-coloured upward sloping lines depicted cities with increasing economic growth rates and the grey-colored downward sloping lines highlighted those with slowing economic growth.

There is, however, a small problem. The chart erroneously showed some slowing economies as growing and vice versa. Furthermore, the trajectory of the sloping lines would mislead the readers to assume that cities with parallel lines enjoyed a similar increase in the growth rate, which, of course, is not true. The graphical faux pas was certainly avoidable had a bar chart were used.
Source: The Globe and Mail, Page B6, May 15.

Of course, the Globe and Mail is not alone in coming up with math that simply doesn’t add up. While covering the Scottish independence vote in September 2014, CNN reported that Scots voted a 110% in the referendum such that 58% voted yes and another 52% voted no.
Source: Mail Online. September 19, 2014

The recent rise of data journalism has witnessed the emergence of data visualization where the editors increasingly reinforce narrative with creative info-graphics. While major news outlets such as The Economist, The New York Times, and the Wall Street Journal retained experts in data science and visualization, most newspapers have entrusted the task to the graphics departments that rely on tools that are not specifically designed for data visualization. At times, the outcome is math- and logic-defying graphics that present a false picture.

Even when charts correctly depict data, at times the visualizations are too complex for the ordinary newsreader to grasp. Powerful data visualizations tools, such as D3 (a JavaScript library) are often abused to create graphics too rich in detail to comprehend. The use of Hierarchical Edge Bundling, for instance, is becoming increasingly popular in the news media resulting in complex graphics that are visually impressive, but conceptually confusing.

Edward Tufte and Leland Wilkinson have spent a lifetime advising data enthusiasts on how to present data-driven information. Wilkinson is the author of The Grammar of Graphics, which sets out the fundamentals for presenting data. Wilkinson’s writings inspired Hadley Wickham to develop ggplot2, a graphing engine for R, which is increasingly becoming the tool of choice for data scientists. 

Tufte inspired Dona M. Wong, who was the graphics director at the Wall Street Journal. Ms. Wong authored The Wall Street Journal Guide to Information Graphics. Her book is a quintessential guide for those who work with data and would like to present information as charts. She uses examples from the Journal to illustrate the dos and don’ts of presenting data as info-graphics.

Let us return to the forecasted metropolitan growth rates in Canada. I prefer the horizontal bar chart instead. The bar chart offers me several options to highlight the main argument in the story. If I were interested in highlighting cities with the highest gains in growth since 2014, I would sort the cities accordingly, as is illustrated in the graphic on the left (see below). If I were interested in highlighting cities with the highest forecasted growth rate, I would sort them accordingly to result in the graphic on the right.

Dana Wong insists on simplicity in rendering. She concludes her book with a simple message for data visualization: simplify, simplify, simplify. The two bar charts simplify the same information presented by the Globe. The results are obvious: I avoid misrepresenting data. One can readily see Halifax’s economy is forecasted to grow and Vancouver’s to shrink. The Globe’s rendering depicted exactly the opposite.

Thursday, April 23, 2015

UP Express in Toronto: A train less ridden

What does a billion dollars' worth of transit investment get in Toronto? A piddly 5,000 daily riders. To put things in perspective, dozens of bus routes in Toronto carry more passengers every day than the trips forecasted for the Union-Pearson rail link (UP Express).

The rail link will connect Canada's two busiest transport hubs: The Union Station and the Pearson Airport. Despite the high-speed connector between the two busiest hubs, transport authorities expect only 5,000 daily riders on the UP Express. The King Streetcar, in comparison, carries in excess of 65,000 daily riders.

The UP Express and the Sheppard subway extension are examples of transit money well wasted. A 2009 communiqué by Metrolinx estimated that the George Town Expansion (including the UP Express) will cost over a billion dollars. The Globe and Mail reported Ontario government alone had invested $456 million in the UP Express. Instead of investing the scarce transit dollars on projects likely to deliver the highest increase in transit ridership, billions are being spent on projects that will have a marginal impact on addressing traffic congestion in the GTA.

With $29 billion in planned transport infrastructure investments, some of which will be publicised Thursday in the Ontario budget, the Province and the City need to have their priorities right. The very least would be to stop investing in projects that do not generate sufficient transit ridership.

One may argue that 5,000 fewer trips by automobile to and from the Airport should help in easing congestion in the GTA. However, with over 12-million daily trips in the GTA, 5,000 fewer trips are unlikely to make any meaningful difference in traffic congestion. At the same time, the taxpayers should focus on the cost-benefit trade-offs for transit investments. Notice the cost-benefit efficiency of the existing TTC bus service (192 Airport Rocket) to the Pearson Airport that carries over 4,000 daily passengers. A billion dollars later, the UP Express will move only one thousand additional riders.

In North America, fewer than 10 airports are connected with local subway or regional rail transit. With the exception of the Ronald Reagan International Airport in Washington, DC, most other airports accessible by rail report approximately 5% transit trips to and from airports. The European experience though has been better. Almost 35% of the trips to and from Zurich airport were made on rail-based transit. Munich airport reported 40% of the trips by rail and bus.

Certain transit network attributes, which are missing for the UP Express, contribute to the strong transit ridership to and from airports. For instance, the rail-based service to high transit ridership airports does not terminate at the airport but instead continues further to serve the communities along the corridor. In addition, the airport lines at the successful airports are integrated with the rest of the rail-based transit system, instead of being a standalone line. The UP Express is a standalone rail line that connects to only one terminal at Pearson Airport. The prohibitive fare makes the ride uneconomical for commuters travelling in teams of two or more who would find a cab ride cheaper and convenient from most parts of suburban Toronto.

Two other key factors limit the ridership potential of the UP Express. First, the Billy Bishop Airport near downtown Toronto caters to the short-haul business travel market. It has been argued in the past that business travellers originating in downtown Toronto would rather take the train than a cab to Pearson Airport. Given the frequency of service and choice of destinations served by the Billy Bishop Airport, business travellers increasingly favour the downtown airport, which eats into the UP Express potential market share.

In addition, the peak operations at Pearson Airport coincide with the morning and afternoon peak commuting times in Toronto. This implies that one would have to commute to Union Station in the morning and afternoon peak travel periods to ride the UP Express. The extra effort in time and money required to travel to downtown Toronto from the inner suburbs alone will deter riders from using the Union-Person rail link.

The UP Express is yet another monument dedicated to public transit misadventures while the region continues to suffer from gridlock. Getting the transit priorities right is necessary before Ontario dolls out $29 billion.

Thursday, April 9, 2015

Stata embraces Bayesian statistics

Stata 14 has just been released. The new and big thing with version 14 is the introduction of Bayesian Statistics. A wide variety of new models can now be estimated with Stata by combining 10 likelihood models, 18 prior distributions, different types of outcomes, and multiple equation models. Stata has also made available a 255-page reference manual for free to illustrate Bayesian statistical analysis.

Of course R already offered numerous options for Bayesian Inference. It will be interesting to hear from colleagues proficient in Bayesian statistics to compare Stata’s newly added functionality with what has already been available from R.

Given the hype with big data and the newly generated demand for data mining and advanced analytics, it would have been timely for Stata to also add data mining and machine learning algorithms. My two cents: data mining algorithms are in greater demand than Bayesian statistics. Stata users will have to wait for a year or more to see such capabilities. In the meanwhile, R offers several options for data mining and machine learning algorithms.

Monday, November 3, 2014

R: What to do when help doesn't arrive

R is great when it works. Not so much, when it doesn’t. Specifically, this becomes a concern when the packages are not fully illustrated in the accompanied help documentation, and the author/s of the package don’t respond (in time).

I am not suggesting that the package authors should respond to every email that they receive. My request is that the documentation should be complete enough so that the authors’ help is no longer required on a day-to-day basis.

Recently, a colleague in the US and I became interested in the mlogit package. We wanted to use the weights option in the package. Just like most other packages, mlogit does not illustrate how to use weights, but advises that the option is available. We assumed that the weights would work in a certain way (see page 26 of the hyperlinked document). However, when I estimated the model with weights, mlogit did not replicate the results from a popular textbook on econometrics. Here are the details.

We wanted to see if the the weights option could be used in an alternative specific logit formulation when the sampled data do not conform to the market shares observed in the underlying population? For instance, in a travel choice model, one may be tempted to over sample train commuters and under-sample car commuters because often car commuters far outnumber the train commuters for inter-city travel in the underlying population. This is true for most of Canada and the US.  In such circumstances, we would weight the data set so that the estimated model reproduces the population market shares rather than the sample shares.

The commercially available software, NLogit/LimDep can do this with ease. I wanted to replicate the results for choice-based weights for the conditional logit model in Professor Bill Greene's book, Econometric Analysis. This is illustrated on page 853 of the 6th edition of the text where Table 23.24 presents the parameter estimates for a conditional (McFadden) logit model for the un-weighted and the choice-based weighted versions. I replicated the results using NLogit with a simple addition of population market shares in the two-line syntax. However, the results generated by mlogit package bear no resemblance to the ones listed in Econometric Analysis.

It turns out that Stata is also limited in the way it handles weights for the two estimation options: asclogit and clogit. I know this because colleagues at Stata were quite diligent in responding to my requests. It’s not the same with the mlogit, which may or may not be able to handle the weights. We will only know when the author responds.

I am recommending that it should not be left to the individual authors to bear the sole responsibility for supporting the R packages. The individual could be ill, busy, or unavailable for a variety of reasons. This limitation could be proactively dealt with if the R community collectively generates help documentation by detailed worked-out examples of all available options (including weights), and not the few frequently used ones.

Improving documentation will be key to helping R branch out to the everyday users of statistical analysis. The tech-savvy can iron out the kinks. They have the curiosity, patience, and time on their hand. The rest of the world is not that fortunate.

I propose that users of the packages, and not just the authors, should collaborate to generate help documentation as vignettes and YouTube videos. This will do more in popularizing R than another 6,000 new packages that few may know how to work with.