Monday, May 20, 2019

Modern Data Science with R: A review


Some say data is the new oil. Others equate its worth to water. And then there are those who believe that data scientists will be (in fact, they already are) one of the most sought-after workers in knowledge economies.

Millions of data-centric jobs require millions of trained data scientists. However, the installed capacity of graduate and undergraduate programs in data science is nowhere near meeting this demand over the next many years.

So how do we produce data scientists?

Given the enormous demand for data scientists and the fixed supply from higher education institutions, it is quite likely that one must look beyond colleges and universities to train a large number of data scientists desired by the marketplace.

Getting trained on the job is one possible route. This will require repurposing the existing workforce. To prepare the current workforce for data science, one needs training manuals. One such manual is Modern Data Science with R (MDSR).

Published by the CRC Press (Taylor and Francis Group) and authored by three leading academics: Professors Baumer, Kaplan, and Horton, MDSR is the missing manual for data science with R. The book is equally relevant to data science programs in higher ed as it is to practitioners who would like to embark on a career in data science or to get a taste of an aspect of data science that they have not explored in the past.

As the book’s name suggests, the text is based on R, one of the most popular and versatile computing platforms. R is a freeware and is being developed by thousands of volunteers in real time. In addition to base R, which comes bundled with thousands of commands and functions, the user-written packages, whose number has exceeded 14,000 (as of May 2019), further expand the universe of features making R perhaps the most diverse computing platform.

Despite the immense popularity of data science, only a handful of titles focus exclusively on the topic. Hadley Wickham’s R for Data Science and R Cookbook by Paul Teetor are the other two other worthy texts. MDSR is unique in the sense that it serves as an introduction to a whole host of analytic techniques that are seldom covered in one title.

In the following paragraphs, I’ll discuss the salient features of the textbook. I begin with my favourite attribute of the book that deals with its organization. Instead of muddling with theories and philosophies, the book gets straight to business and starts the conversation with data visualization. A graphic is worth a thousand words, and MDSR is proof of it.

And since Hadley Wickham’s influence on data science is ubiquitous, MDSR also embraces Wickham’s implementation of Grammar of Graphics in R with one of the most popular R packages, ggplot2.

Another avenue where Wickham’s influence is widely felt is data wrangling. A suite of R packages bundled under the broader rubric of Tidyverse is influencing how data scientists manipulate small and big data. Chapter 4 in MDSR perhaps is one of the best and succinct introduction to data wrangling with R and Tidyverse. From the simplest to more advanced examples, MDSR equips the beginner with the basics and the advanced user with new ways to think about analyzing data.

A key feature of MDSR is that it’s not another book on statistics or econometrics with R. Yours truly is guilty of authoring one such book. Instead, MDSR is a book focused squarely on data manipulation. The treatment of statistical topics is not absent from the book; however, it’s not the book’s focus. It is for this reason that the discussion on Regression models is in the appendices.

But make no mistake, even when statistics is not the focus, MDSR offers sound advice on the practice of statistics. Section 7.7, The perils of p-values, warns the novice statisticians about not becoming the unsuspecting victims of hypothesis testing.

The books distinguishing feature remains the diversity of data science challenges it addresses. For instance, in addition to data visualization, the book offers an introduction to interactive data graphics and dynamic data visualization.

At the same time, the book covers other diverse topics, such as database management with SQL, working with spatial data, analyzing text-based (non-structured) data, and the analysis of networks. A discussion about ethics in data science is undoubtedly a welcome feature in the book.

The book is punctuated with hundreds of useful and hands-on data science examples and exercise, providing ample opportunities to put concepts to practise. The book’s accompanying website offers additional resources and code examples. At the time of this review, not all code was available for download.

Also, while I was able to reproduce more straightforward examples, I ran into trouble with complex ones. For instance, I could not generate advanced spatial maps showing flights origins and destinations.

My recommendation to authors will be to maintain an active supporting website because R packages are known to evolve, and some functionality may change or disappear over time. For instance, the mapping algorithms that are part of the ggmap package now require a Google maps API or else the maps will not display. This change has likely occurred after the book was published.

In summary, for aspiring and experienced data scientists, Modern Data Science with R is a book deserving to be in their personal libraries.

Murtaza Haider lives in Toronto and teaches in the Department of Real Estate Management at Ryerson University. He is the author of Getting Started with Data Science: Making Sense of Data with Analytics, which was published by the IBM Press/Pearson.

No comments:

Post a Comment