Wednesday, July 27, 2016

Book Review: Getting Started With Data Science

I PROGRAMMER's Kay Ewbank's reviews Getting Started with Data Science: Making Sense of Data with Analytics.

By Kay Ewbank

If you've enjoyed books such as Freakonomics or Outliers, you'll feel at home reading this book as it uses a similar approach; take an interesting question such as 'Does the higher price of cigarettes deter smoking?', and use that as the basis for some data analysis.

The aim is to teach you how to do your own analyses. Haider works through the examples in R, Stata, SPSS and SAS. Within the book the examples are worked mainly in R, and one of the other languages. The code for the other languages is available for download from the IBM Press website, along with details of how to use it. 

The book opens with a chapter called 'the bazaar of storytellers' that discusses what data science is and gives the author's definition of a data scientist. The next chapter, data in the 24/7 connected world, identifies sources of data that you can analyse, and also introduces the concept of big data. Chapter three looks at how data becomes meaningful when it is used as the basis for 'stories'. Haider's view is that the strength of data science lies in the power of the narrative, and that is what underpins most of the book.

"Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis."

From a practical perspective, the book begins to get useful in chapter four,  which looks at how you can generate summary tables, including multi-dimensional tables. Next is a chapter on graphics and how to generate them. If you're thinking that it seems a bit odd to concentrate on the 'end result' first, you have to remember that the author's view is that data analysis is only useful if your audience actually looks at the results and understands them.

The next chapter gets more into the workings of data analysis with an examination of hypothesis testing using techniques such as t-tests and correlation analysis. Regression analysis is looked at next, based on the notions "why tall parents don't have even taller children". This is a fun chapter, with examples including consumer spending on food and alcohol, housing markets, and whether the appearance of teachers affects their evaluations by students.

A chapter on analysis of binary variables considers logit and probit models using data from New York transit use. Categorical data and multinomial variables are the topic of the next chapter, which expands on the ideas of logit models.

Spatial data analysis is covered next, taking us into the use of GIS systems and how these have expanded the options for data analysis. There's a good chapter on time series analysis looking at how regression models can be used with time series data, using the examples of forecasting housing markets.

The final chapter introduces the field of data mining. It's more of a taster discussing some of the techniques that can be used, but fun anyway.

Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis. 

Related Reviews

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Sunday, July 24, 2016

The collaborative innovation landscape in data science

Computing platforms should be like Lego. That is, they should provide the fundamental building blocks and enable the users' imagination to innovate. The latest issue of Stata Journal exemplifies how Stata and, by the same account, R provide the platform for the users to innovate beyond the innate capacity of the core group responsible for software development.

Earlier in July, I received in an email the table of contents for the Stata Journal’s latest issue. I was expecting to see one or maybe two articles of interest. What I found was quite surprising. I was intrigued by almost every article, which made me wonder if I had lost my academic focus so that almost anything is now of interest?  

As I browsed through the journal, I noticed that the authors contributing to the journal were truly international. From academic colleagues in Germany and the United States to colleagues working for central banks in Europe, the diversity was hard to ignore. And that’s where I spotted the apparent similarity between R and Stata. Even though Stata is a proprietary computing platform, the innovation landscape is not restricted to the core team at Stata. This is similar to the R environment where literally thousands of packages (algorithms) for R are contributed by independent researchers.

For R, such collaborative ecosystem comes naturally for R being free software. Stata, on the other hand, follows a more traditional market approach of charging for the use of the software. Yet, Stata and R are able to attract leading data scientists (my preferred term for statisticians, econometricians, and others) to volunteer their innovation expertise that they readily share with the larger community.   

Returning to the latest issue, I was first attracted to the article on assessing inequality using percentile shares. As the author, Ben Jann from the University of Bern, noted income inequality has come to the forefront of academic and social discourse since the publication of Thomas Piketty’s Capital in the Twenty-First Century. I have been intrigued by the topic for years, primarily influenced by the incredible works of Joseph Stiglitz, Angus Deaton, and others.  Piketty’s Capital, despite the criticism (watch Deidre Mccloskey’s careful, yet blunt, review of the Capital), has made percentile shares familiar to analyze distributional inequalities.

Ben Jann has contributed pshare to Stata that readily estimates inequalities with the convenience of a single-line syntax. Using the data from the 1988 US National Study of Young Women, the command easily computes the income distribution showing that the top 10-percent women received 27% of the wages.

For R users, I would recommend the ineq package by Achim Zeileis and Christian Kleiber for generating inequality and poverty indices.  

My primary area of interest lies at the intersection of real estate and transportation in urban settings. I am always struggling with how location impacts rents, growth, and other socio-economic outcomes. Determining the location or, for that matter, distances between entities is usually a struggle. Thanks to the GIS software, such as QGIS, MapInfo, and Maptitude, the task of spatial computation has become a lot easier. Still, one has to get proficient on several computing platforms to achieve the necessary tasks of getting distances or travel times to and from locations. Stata offers two interesting solutions for these tasks. The latest one is reported in the latest issue. Stephan Huber and Christoph Rust from the University of Regensburg have a contributed a new command that computes network distances (not just the straight-line Euclidean distances) and network travel times for the shortest paths that rely on Open Source Routing Machine and OpenStreetMap.

Earlier in 2011, Adam Ozimek and Daniel Miles contributed commands to geocode and compute travel times between origins and destinations for different modes of travel, including public transit.

R is equally equipped for similar tasks. Timothée Giraud, Robin Cura, and Matthieu Viry programmed an R package osrm to determine travel time and distances. Other R packages include gdistance and gmapdistance, to name a few.

In summary, I remain delightedly optimistic about the future of both open source and proprietary computing platforms. Altruism is the name of the game where thousands of innovators are making their generous contributions available for the larger benefit of the society making it easier for applied data scientists to satisfy their curiosities by applying readily available algorithms to solve riddles. 

Monday, July 18, 2016

Data Science Boot Camp completed at Ryerson University

I am pleased to update you on the Data Science Boot Camp we ran at the Ted Rogers School of Management at Ryerson University in Toronto in collaboration with IBM’s The 9-week long Boot Camp concluded on July 15. 

We received a total of 1,137 registrations and the attendance ranged between 100 to 150 participants each week. 

I have made the resources (software codes, PowerPoints, etc.) available online at We recorded 24 hours of video, which we will be online soon.

I restricted the hands-on training to R, hence the Boot Camp serves as an introduction to analytics with R. You are welcome to share these resources.

A breakdown of weekly schedule is provided in the following hyperlinked list: