Monday, August 22, 2016

Five Questions about Data Science

---

Recently, we were able to ask five questions of Murtaza Haider, about the new book from IBM Press called “Getting Started with Data Science: Making Sense of Data with Analytics.” Below, the author talks about the benefits of data science in today’s professional world.

Getting Started with Data Science

1. What are some examples of data science altering or impacting traditional professional roles already?

Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.

The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.

Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.

As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.

2. What steps can a professional take today to learn how and why to implement data science into their current role?

Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.

For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.

For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.

3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?

Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.

The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.

If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.

A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.

4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?

Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.

Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.

5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?

Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.

As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.

Wednesday, August 10, 2016

So you want to be a data scientist



The New York Times made it look so easy. Take a few courses in data science and a web-based startup will readily pay top dollars for your newly acquired skills.

Since the McKinsey Global Institute reported on the impending shortage of data crunchers, the wanna be data scientists are searching for learning opportunities in big data analytics. Newspaper coverage suggests that even with limited previous exposure to empirics, one may enroll in MOOCs or join programming boot camps to establish one's bonafides.

In a recent blog on Forbes.com, Meta S. Brown, the author of Data Mining for Dummies, gave four reasons not to get an advanced degree in data science. I, on the other hand, believe that a structured learning environment is exactly what many need to enable the career change they have contemplated for years but have not moved on it.

It all depends on upon what kind of a learner you are. If you are a disciplined, self-motivated, self-actuated individual, you can pick up the skills by attending MOOCs or participating in coding boot camps.

But if you are like the rest of us, who once enrolled in a free online course, but didn't complete it, you need some structure. A degree or a certificate in data science or business analytics is exactly what you need to upgrade your skills and be part of the network that will help you reorient your career.

In my book, Getting Started with Data Science, I mentioned Paul Minton, who was making $20,000 serving tables in New York. However, a three-moth programming course at the Zipfian Academy turned his life around. He earned over $100,000 in 2014 as a data scientist for a web startup in San Francisco. "Six figures, right off the bat ... To me, it was astonishing," he told The New York Times.

When the inspiring data scientists think of a career in the 'glamorous' world of big data and analytics, they think of Mr. Minton. His story, though a bit Cinderella-ish, is true, but rare. He works for Change.org! However, not everyone should expect a similar outcome. In addition to good fortune, Mr. Minton had majored in math in his undergraduate training, and we all know that math helps. It will be unwise, however, to assume that with almost no  empirical background, one can master the 
complex world of data and algorithms in a matter of a few weeks and be gainfully employed.

While speaking at meet-ups organized by IBM's BigDataUniversity, I encounter dozens of enthusiasts who are keen to start training in data science but do not know where to begin. I advise them to build on their core competencies and domain knowledge. For instance, if you studied journalism or creative writing as an undergraduate, you might want to learn how to analyze socioeconomic data instead of trying to set up Hadoop clusters, a big data task best left to computer scientists and engineers.

If you are a disciplined learner, you can explore data science training offered as MOOCs. Coursera, one of the largest MOOCs platform, listed several data science courses among the top 10 most popular courses in 2015. IBM's Big Data University (BDU) is another platform dedicated to promoting training in data science and analytics. Not only BDU offers similar resources for online learning as other platforms, it also offers cloud-based resources for hands-on training through the Data Scientist Workbench.

The Workbench provides the state-of-the-art computing solutions for regular-sized data. These include RPython, and OpenRefine. To wrangle big data, the Workbench offers Hadoop and Spark-based solutions. Such coupling of computing infrastructure with online learning resources frees the new learners from the concerns about installing and maintaining software and clustering hardware.

For learners who would prefer a structured learning environment, they also have several options. They can register for courses or certificates offered by universities' continuing education faculties, enroll in an online graduate degree in data science, or take a more traditional approach of enrolling in a full- or part-time Master's program.

A good place to search for learning opportunities is the KDNuggets website that maintains detailed lists of post-graduate programs in data science including full-time, part-time, and online masters and other certifications.

Once you have earned some credentials, you still have to prove your worth to future employers. If you are making a switch from another career, your experience may not be of much use in your pursuits in the data-centric world. My advice to the novice data scientists lacking experience is to ask the potential employer not necessarily for a job, but instead for a data set and a puzzle. If you can solve a data-oriented problem for a firm as part of the vetting process, you can overcome the shortcomings in your résumé.

For those who are still on the fence thinking whether to take the plunge into the world of big data and analytics, they should know that the demand for data scientists far exceeds the capacity of the universities and colleges to produce them. This is unlikely to change shortly. Act now and embrace data. 

Wednesday, July 27, 2016

Book Review: Getting Started With Data Science

I PROGRAMMER's Kay Ewbank's reviews Getting Started with Data Science: Making Sense of Data with Analytics.

By Kay Ewbank

If you've enjoyed books such as Freakonomics or Outliers, you'll feel at home reading this book as it uses a similar approach; take an interesting question such as 'Does the higher price of cigarettes deter smoking?', and use that as the basis for some data analysis.

The aim is to teach you how to do your own analyses. Haider works through the examples in R, Stata, SPSS and SAS. Within the book the examples are worked mainly in R, and one of the other languages. The code for the other languages is available for download from the IBM Press website, along with details of how to use it. 

The book opens with a chapter called 'the bazaar of storytellers' that discusses what data science is and gives the author's definition of a data scientist. The next chapter, data in the 24/7 connected world, identifies sources of data that you can analyse, and also introduces the concept of big data. Chapter three looks at how data becomes meaningful when it is used as the basis for 'stories'. Haider's view is that the strength of data science lies in the power of the narrative, and that is what underpins most of the book.

"Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis."


From a practical perspective, the book begins to get useful in chapter four,  which looks at how you can generate summary tables, including multi-dimensional tables. Next is a chapter on graphics and how to generate them. If you're thinking that it seems a bit odd to concentrate on the 'end result' first, you have to remember that the author's view is that data analysis is only useful if your audience actually looks at the results and understands them.


The next chapter gets more into the workings of data analysis with an examination of hypothesis testing using techniques such as t-tests and correlation analysis. Regression analysis is looked at next, based on the notions "why tall parents don't have even taller children". This is a fun chapter, with examples including consumer spending on food and alcohol, housing markets, and whether the appearance of teachers affects their evaluations by students.

A chapter on analysis of binary variables considers logit and probit models using data from New York transit use. Categorical data and multinomial variables are the topic of the next chapter, which expands on the ideas of logit models.

Spatial data analysis is covered next, taking us into the use of GIS systems and how these have expanded the options for data analysis. There's a good chapter on time series analysis looking at how regression models can be used with time series data, using the examples of forecasting housing markets.

The final chapter introduces the field of data mining. It's more of a taster discussing some of the techniques that can be used, but fun anyway.

Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis. 

Related Reviews


To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Sunday, July 24, 2016

The collaborative innovation landscape in data science


Computing platforms should be like Lego. That is, they should provide the fundamental building blocks and enable the users' imagination to innovate. The latest issue of Stata Journal exemplifies how Stata and, by the same account, R provide the platform for the users to innovate beyond the innate capacity of the core group responsible for software development.

Earlier in July, I received in an email the table of contents for the Stata Journal’s latest issue. I was expecting to see one or maybe two articles of interest. What I found was quite surprising. I was intrigued by almost every article, which made me wonder if I had lost my academic focus so that almost anything is now of interest?  

As I browsed through the journal, I noticed that the authors contributing to the journal were truly international. From academic colleagues in Germany and the United States to colleagues working for central banks in Europe, the diversity was hard to ignore. And that’s where I spotted the apparent similarity between R and Stata. Even though Stata is a proprietary computing platform, the innovation landscape is not restricted to the core team at Stata. This is similar to the R environment where literally thousands of packages (algorithms) for R are contributed by independent researchers.

For R, such collaborative ecosystem comes naturally for R being free software. Stata, on the other hand, follows a more traditional market approach of charging for the use of the software. Yet, Stata and R are able to attract leading data scientists (my preferred term for statisticians, econometricians, and others) to volunteer their innovation expertise that they readily share with the larger community.   

Returning to the latest issue, I was first attracted to the article on assessing inequality using percentile shares. As the author, Ben Jann from the University of Bern, noted income inequality has come to the forefront of academic and social discourse since the publication of Thomas Piketty’s Capital in the Twenty-First Century. I have been intrigued by the topic for years, primarily influenced by the incredible works of Joseph Stiglitz, Angus Deaton, and others.  Piketty’s Capital, despite the criticism (watch Deidre Mccloskey’s careful, yet blunt, review of the Capital), has made percentile shares familiar to analyze distributional inequalities.

Ben Jann has contributed pshare to Stata that readily estimates inequalities with the convenience of a single-line syntax. Using the data from the 1988 US National Study of Young Women, the command easily computes the income distribution showing that the top 10-percent women received 27% of the wages.

For R users, I would recommend the ineq package by Achim Zeileis and Christian Kleiber for generating inequality and poverty indices.  

My primary area of interest lies at the intersection of real estate and transportation in urban settings. I am always struggling with how location impacts rents, growth, and other socio-economic outcomes. Determining the location or, for that matter, distances between entities is usually a struggle. Thanks to the GIS software, such as QGIS, MapInfo, and Maptitude, the task of spatial computation has become a lot easier. Still, one has to get proficient on several computing platforms to achieve the necessary tasks of getting distances or travel times to and from locations. Stata offers two interesting solutions for these tasks. The latest one is reported in the latest issue. Stephan Huber and Christoph Rust from the University of Regensburg have a contributed a new command that computes network distances (not just the straight-line Euclidean distances) and network travel times for the shortest paths that rely on Open Source Routing Machine and OpenStreetMap.

Earlier in 2011, Adam Ozimek and Daniel Miles contributed commands to geocode and compute travel times between origins and destinations for different modes of travel, including public transit.

R is equally equipped for similar tasks. Timothée Giraud, Robin Cura, and Matthieu Viry programmed an R package osrm to determine travel time and distances. Other R packages include gdistance and gmapdistance, to name a few.


In summary, I remain delightedly optimistic about the future of both open source and proprietary computing platforms. Altruism is the name of the game where thousands of innovators are making their generous contributions available for the larger benefit of the society making it easier for applied data scientists to satisfy their curiosities by applying readily available algorithms to solve riddles. 

Monday, July 18, 2016

Data Science Boot Camp completed at Ryerson University

I am pleased to update you on the Data Science Boot Camp we ran at the Ted Rogers School of Management at Ryerson University in Toronto in collaboration with IBM’s www.BigDataUniversity.com. The 9-week long Boot Camp concluded on July 15. 

We received a total of 1,137 registrations and the attendance ranged between 100 to 150 participants each week. 

I have made the resources (software codes, PowerPoints, etc.) available online at https://sites.google.com/site/statsr4us/workshops/datascience. We recorded 24 hours of video, which we will be online soon.

I restricted the hands-on training to R, hence the Boot Camp serves as an introduction to analytics with R. You are welcome to share these resources.

A breakdown of weekly schedule is provided in the following hyperlinked list:


Thursday, May 12, 2016

Data Science Boot Camp

If you live in or near Toronto, are interested in learning about data science, and can spare Friday afternoons, then you are in luck. I am offering a Data Science Boot Camp at Ryerson University in collaboration with IBM's BigDataUniversity.com.


The Boot Camp is largely based on the contents of my recently published book, Getting Started with Data Science: Making Sense of Data with Analytics. You can read more about the book by Clicking HERE.

Logistical details:

When: Fridays (2:00 - 5:00 pm)
Where: 55 Dundas Street West, Toronto, 9th floor, Room 3-109
     Ted Rogers School of Management, Ryerson University
Cost: Free (Courtesy Ryerson University & BigDataUniversity)
Starting on: May 13 for introductions. Actual launch is on May 20.
Spaces: I'd like to cap enrollment at 15.
Registration: Email us or use Registration Form at BigDataUniversity.
Prerequisites: Curiosity, high-school math, prescribed book, a laptop computer, and willingness to learn R.

BigDataUniversity will live stream the sessions for those who are unable to attend, but interested in the topic.

Tentative Schedule

May 13, 2016- Introductions, software details, and logistical details.
Week 1 - Taking the first step
  • Detailed hands-on examples of analytics to understand what you will be able to accomplish by the end of the boot camp.
Week 2 - Data: It’s shapes, sizes, and formats
Week 3 - Regression: The tool that fixes everything, or almost everything.
  • Applied analytics with teaching evaluations. 
  • Do good-looking instructors get higher teaching evaluations?
Week 4 - Correlations, causations, and manufactured facts
Week 5 - Aerobics with data: Taming your data to meet your needs.
Week 6 - Time is money: Analytics with time series data.
Week 7 - Case study 1: 
  • Do women who lack health insurance from their spouse’s employer more likely to work full-time?
Week 8 - Case Study 2: 
  • Do higher taxes result in lower cigarette sales? Did Land Transfer Tax impact housing sales in Toronto?
Week 9 - Case Study 3: 
  • To smoke or not to smoke: that is the question.
Week 10 - Case study 4: 
  • Is space the new frontier? Map it to know it.

Wednesday, January 13, 2016

Getting Started with Data Science: Storytelling with Data

Earlier this month, IBM Press and Pearson have published my book titled: Getting Started with Data Science: Making Sense of Data with Analytics. You can download sample pages, including a complete chapter. There are 104 pages in the sample. You can also watch a brief interview about the book recorded earlier at the IBM Insight2015 Conference.

The very purpose of authoring this book was to rethink the way we have been teaching statistics and analytics to students and practitioners. It is no secret that most students required to take the mandatory stats course dislike it. I believe it has something to do with the way we have been teaching the subject than to do with the aptitude of our students. Furthermore, I believe there is a greater opportunity to equip the students with the skills needed in a world awash with data where competing on analytics defines the real competitive advantage.

No wonder, the latest issue of the leading publication on the subject, The American Statistician, is dedicated to reimagining how statistics should be taught in the undergraduate curriculum. The editors noted:
“We hope that this collection of articles as well as the online discussion provide useful fodder for further review, assessment, and continuous improvement of the undergraduate statistics curriculum that will allow the next generation to take a leadership role by making decisions using data in the increasingly complex world that they will inhabit.”
I am confident that my book will do its small part in equipping the next generation of students with the kind of skills needed to succeed in a data-centric world. For one, I have taken a storytelling approach to statistics. This book reinforces the point that data science and analytics training should be applied rather than theoretical, and the ultimate purpose of producing or consuming statistical analysis is to tell fascinating stories from it. Therefore, the book opens with the chapter titled, The Bazaar of Storytellers.

Who is this book for?

While the world is awash with large volumes of data, inexpensive computing power, and vast amounts of digital storage, the skilled workforce capable of analyzing data and interpreting it is in short supply. A 2011 McKinsey Global Institute report suggests that “the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”


Getting Started with Data Science (GSDS) is a purpose-written book targeted at those professionals who are tasked with analytics, but they do not have the comfort level needed to be proficient in data-driven analytics. GSDS appeals to those students who are frustrated with the impractical nature of the prescribed textbooks and are looking for an affordable text to serve as a long-term reference. GSDS embraces the 24-7 streaming of data and is structured for those users who have access to data and software of their choice, but do not know what methods to use, how to interpret the results, and most importantly how to communicate findings as reports and presentations in print or on-line.

GSDS is a resource for millions employed in knowledge-driven industries where workers are increasingly expected to facilitate smart decision-making using up-to-date information that sometimes takes the form of continuously updating data.

At the same time, the learning-by-doing approach in the book is equally suited for independent study by senior undergraduate and graduate students who are expected to conduct independent research for their coursework or dissertations.

Praise for the book

I am also pleased to share with you the praise for my book by Dr. Munir Sheikh, Canada’s former chief statistician:
“The power of data, evidence, and analytics in improving decision-making for individuals, businesses, and governments is well known and well documented. However, there is a huge gap in the availability of material for those who should use data, evidence, and analytics but do not know how. This fascinating book plugs this gap, and I highly recommend it to those who know this field and those who want to learn.”
— Munir A. Sheikh, Ph.D., Distinguished Fellow and Adjunct Professor at Queen’s University

Tom Davenport, author of the bestselling books Competing on Analytics and Big Data @ Work.has the following to say about my book:
“A coauthor and I once wrote that data scientists held ‘the sexiest job of the 21st century.’ This was not because of their inherent sex appeal, but because of their scarcity and value to organizations. This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most importantly it teaches how to tell a story with data.”
—Thomas H. Davenport, Distinguished Professor, Babson College; Research Fellow, MIT.

Dr. Patrick Surry
, Chief Data Scientist at www.Hopper.com had the following to say:
“This book addresses the key challenge facing data science today, that of bridging the gap between analytics and business value. Too many writers dive immediately into the details of specific statistical methods or technologies, without focusing on this bigger picture. In contrast, Haider identifies the central role of narrative in delivering real value from big data.

“The successful data scientist has the ability to translate between business goals and statistical approaches, identify appropriate deliverables, and communicate them in a compelling and comprehensible way that drives meaningful action. To paraphrase Tukey, ‘Far better an approximate answer to the right question, than an exact answer to a wrong one.’ Haider’s book never loses sight of this central tenet and uses many realworld examples to guide the reader through the broad range of skills, techniques, and tools needed to succeed in practical data-science. “Highly recommended to anyone looking to get started or broaden their skillset in this fast-growing field.”
And finally, Professor Atif Mian, author of the best-selling book: The House of Debt offered the following assessment:
“We have produced more data in the last two years than all of human history combined. Whether you are in business, government, academia, or journalism, the future belongs to those who can analyze these data intelligently. This book is a superb introduction to data analytics, a must-read for anyone contemplating how to integrate big data into their everyday decision making.”
— Professor Atif Mian, Theodore A. Wells ’29 Professor of Economics and Public Affairs,
Princeton University; Director of the Julis-Rabinowitz Center for Public Policy and Finance at the Woodrow Wilson School.