Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Monday, May 20, 2019

Modern Data Science with R: A review


Some say data is the new oil. Others equate its worth to water. And then there are those who believe that data scientists will be (in fact, they already are) one of the most sought-after workers in knowledge economies.

Millions of data-centric jobs require millions of trained data scientists. However, the installed capacity of graduate and undergraduate programs in data science is nowhere near meeting this demand over the next many years.

So how do we produce data scientists?

Given the enormous demand for data scientists and the fixed supply from higher education institutions, it is quite likely that one must look beyond colleges and universities to train a large number of data scientists desired by the marketplace.

Getting trained on the job is one possible route. This will require repurposing the existing workforce. To prepare the current workforce for data science, one needs training manuals. One such manual is Modern Data Science with R (MDSR).

Published by the CRC Press (Taylor and Francis Group) and authored by three leading academics: Professors Baumer, Kaplan, and Horton, MDSR is the missing manual for data science with R. The book is equally relevant to data science programs in higher ed as it is to practitioners who would like to embark on a career in data science or to get a taste of an aspect of data science that they have not explored in the past.

As the book’s name suggests, the text is based on R, one of the most popular and versatile computing platforms. R is a freeware and is being developed by thousands of volunteers in real time. In addition to base R, which comes bundled with thousands of commands and functions, the user-written packages, whose number has exceeded 14,000 (as of May 2019), further expand the universe of features making R perhaps the most diverse computing platform.

Despite the immense popularity of data science, only a handful of titles focus exclusively on the topic. Hadley Wickham’s R for Data Science and R Cookbook by Paul Teetor are the other two other worthy texts. MDSR is unique in the sense that it serves as an introduction to a whole host of analytic techniques that are seldom covered in one title.

In the following paragraphs, I’ll discuss the salient features of the textbook. I begin with my favourite attribute of the book that deals with its organization. Instead of muddling with theories and philosophies, the book gets straight to business and starts the conversation with data visualization. A graphic is worth a thousand words, and MDSR is proof of it.

And since Hadley Wickham’s influence on data science is ubiquitous, MDSR also embraces Wickham’s implementation of Grammar of Graphics in R with one of the most popular R packages, ggplot2.

Another avenue where Wickham’s influence is widely felt is data wrangling. A suite of R packages bundled under the broader rubric of Tidyverse is influencing how data scientists manipulate small and big data. Chapter 4 in MDSR perhaps is one of the best and succinct introduction to data wrangling with R and Tidyverse. From the simplest to more advanced examples, MDSR equips the beginner with the basics and the advanced user with new ways to think about analyzing data.

A key feature of MDSR is that it’s not another book on statistics or econometrics with R. Yours truly is guilty of authoring one such book. Instead, MDSR is a book focused squarely on data manipulation. The treatment of statistical topics is not absent from the book; however, it’s not the book’s focus. It is for this reason that the discussion on Regression models is in the appendices.

But make no mistake, even when statistics is not the focus, MDSR offers sound advice on the practice of statistics. Section 7.7, The perils of p-values, warns the novice statisticians about not becoming the unsuspecting victims of hypothesis testing.

The books distinguishing feature remains the diversity of data science challenges it addresses. For instance, in addition to data visualization, the book offers an introduction to interactive data graphics and dynamic data visualization.

At the same time, the book covers other diverse topics, such as database management with SQL, working with spatial data, analyzing text-based (non-structured) data, and the analysis of networks. A discussion about ethics in data science is undoubtedly a welcome feature in the book.

The book is punctuated with hundreds of useful and hands-on data science examples and exercise, providing ample opportunities to put concepts to practise. The book’s accompanying website offers additional resources and code examples. At the time of this review, not all code was available for download.

Also, while I was able to reproduce more straightforward examples, I ran into trouble with complex ones. For instance, I could not generate advanced spatial maps showing flights origins and destinations.

My recommendation to authors will be to maintain an active supporting website because R packages are known to evolve, and some functionality may change or disappear over time. For instance, the mapping algorithms that are part of the ggmap package now require a Google maps API or else the maps will not display. This change has likely occurred after the book was published.

In summary, for aspiring and experienced data scientists, Modern Data Science with R is a book deserving to be in their personal libraries.

Murtaza Haider lives in Toronto and teaches in the Department of Real Estate Management at Ryerson University. He is the author of Getting Started with Data Science: Making Sense of Data with Analytics, which was published by the IBM Press/Pearson.

Monday, August 22, 2016

Five Questions about Data Science

---

Recently, we were able to ask five questions of Murtaza Haider, about the new book from IBM Press called “Getting Started with Data Science: Making Sense of Data with Analytics.” Below, the author talks about the benefits of data science in today’s professional world.

Getting Started with Data Science

1. What are some examples of data science altering or impacting traditional professional roles already?

Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.

The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.

Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.

As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.

2. What steps can a professional take today to learn how and why to implement data science into their current role?

Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.

For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.

For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.

3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?

Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.

The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.

If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.

A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.

4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?

Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.

Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.

5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?

Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.

As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.

Wednesday, August 10, 2016

So you want to be a data scientist



The New York Times made it look so easy. Take a few courses in data science and a web-based startup will readily pay top dollars for your newly acquired skills.

Since the McKinsey Global Institute reported on the impending shortage of data crunchers, the wanna be data scientists are searching for learning opportunities in big data analytics. Newspaper coverage suggests that even with limited previous exposure to empirics, one may enroll in MOOCs or join programming boot camps to establish one's bonafides.

In a recent blog on Forbes.com, Meta S. Brown, the author of Data Mining for Dummies, gave four reasons not to get an advanced degree in data science. I, on the other hand, believe that a structured learning environment is exactly what many need to enable the career change they have contemplated for years but have not moved on it.

It all depends on upon what kind of a learner you are. If you are a disciplined, self-motivated, self-actuated individual, you can pick up the skills by attending MOOCs or participating in coding boot camps.

But if you are like the rest of us, who once enrolled in a free online course, but didn't complete it, you need some structure. A degree or a certificate in data science or business analytics is exactly what you need to upgrade your skills and be part of the network that will help you reorient your career.

In my book, Getting Started with Data Science, I mentioned Paul Minton, who was making $20,000 serving tables in New York. However, a three-moth programming course at the Zipfian Academy turned his life around. He earned over $100,000 in 2014 as a data scientist for a web startup in San Francisco. "Six figures, right off the bat ... To me, it was astonishing," he told The New York Times.

When the inspiring data scientists think of a career in the 'glamorous' world of big data and analytics, they think of Mr. Minton. His story, though a bit Cinderella-ish, is true, but rare. He works for Change.org! However, not everyone should expect a similar outcome. In addition to good fortune, Mr. Minton had majored in math in his undergraduate training, and we all know that math helps. It will be unwise, however, to assume that with almost no  empirical background, one can master the 
complex world of data and algorithms in a matter of a few weeks and be gainfully employed.

While speaking at meet-ups organized by IBM's BigDataUniversity, I encounter dozens of enthusiasts who are keen to start training in data science but do not know where to begin. I advise them to build on their core competencies and domain knowledge. For instance, if you studied journalism or creative writing as an undergraduate, you might want to learn how to analyze socioeconomic data instead of trying to set up Hadoop clusters, a big data task best left to computer scientists and engineers.

If you are a disciplined learner, you can explore data science training offered as MOOCs. Coursera, one of the largest MOOCs platform, listed several data science courses among the top 10 most popular courses in 2015. IBM's Big Data University (BDU) is another platform dedicated to promoting training in data science and analytics. Not only BDU offers similar resources for online learning as other platforms, it also offers cloud-based resources for hands-on training through the Data Scientist Workbench.

The Workbench provides the state-of-the-art computing solutions for regular-sized data. These include RPython, and OpenRefine. To wrangle big data, the Workbench offers Hadoop and Spark-based solutions. Such coupling of computing infrastructure with online learning resources frees the new learners from the concerns about installing and maintaining software and clustering hardware.

For learners who would prefer a structured learning environment, they also have several options. They can register for courses or certificates offered by universities' continuing education faculties, enroll in an online graduate degree in data science, or take a more traditional approach of enrolling in a full- or part-time Master's program.

A good place to search for learning opportunities is the KDNuggets website that maintains detailed lists of post-graduate programs in data science including full-time, part-time, and online masters and other certifications.

Once you have earned some credentials, you still have to prove your worth to future employers. If you are making a switch from another career, your experience may not be of much use in your pursuits in the data-centric world. My advice to the novice data scientists lacking experience is to ask the potential employer not necessarily for a job, but instead for a data set and a puzzle. If you can solve a data-oriented problem for a firm as part of the vetting process, you can overcome the shortcomings in your résumé.

For those who are still on the fence thinking whether to take the plunge into the world of big data and analytics, they should know that the demand for data scientists far exceeds the capacity of the universities and colleges to produce them. This is unlikely to change shortly. Act now and embrace data. 

Thursday, May 12, 2016

Data Science Boot Camp

If you live in or near Toronto, are interested in learning about data science, and can spare Friday afternoons, then you are in luck. I am offering a Data Science Boot Camp at Ryerson University in collaboration with IBM's BigDataUniversity.com.


The Boot Camp is largely based on the contents of my recently published book, Getting Started with Data Science: Making Sense of Data with Analytics. You can read more about the book by Clicking HERE.

Logistical details:

When: Fridays (2:00 - 5:00 pm)
Where: 55 Dundas Street West, Toronto, 9th floor, Room 3-109
     Ted Rogers School of Management, Ryerson University
Cost: Free (Courtesy Ryerson University & BigDataUniversity)
Starting on: May 13 for introductions. Actual launch is on May 20.
Spaces: I'd like to cap enrollment at 15.
Registration: Email us or use Registration Form at BigDataUniversity.
Prerequisites: Curiosity, high-school math, prescribed book, a laptop computer, and willingness to learn R.

BigDataUniversity will live stream the sessions for those who are unable to attend, but interested in the topic.

Tentative Schedule

May 13, 2016- Introductions, software details, and logistical details.
Week 1 - Taking the first step
  • Detailed hands-on examples of analytics to understand what you will be able to accomplish by the end of the boot camp.
Week 2 - Data: It’s shapes, sizes, and formats
Week 3 - Regression: The tool that fixes everything, or almost everything.
  • Applied analytics with teaching evaluations. 
  • Do good-looking instructors get higher teaching evaluations?
Week 4 - Correlations, causations, and manufactured facts
Week 5 - Aerobics with data: Taming your data to meet your needs.
Week 6 - Time is money: Analytics with time series data.
Week 7 - Case study 1: 
  • Do women who lack health insurance from their spouse’s employer more likely to work full-time?
Week 8 - Case Study 2: 
  • Do higher taxes result in lower cigarette sales? Did Land Transfer Tax impact housing sales in Toronto?
Week 9 - Case Study 3: 
  • To smoke or not to smoke: that is the question.
Week 10 - Case study 4: 
  • Is space the new frontier? Map it to know it.

Friday, October 30, 2015

Curious about big data in Montreal?

Are you in Montreal and curious about big data? Well here is your chance to attend a session about the same at Concordia University on Tuesday, Nov. 03 at 6:00 pm.

www.BigDataUniversity.com, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.

BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.

Further details are available at:

Getting started with Data Science and Introduction to Watson Analytics

http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/

When: Tuesday, November 3rd at 6-9 PM

Where: H1269, 12th floor of the Hall Bldg 
(1455, blvd. De Maisonneuve ouest - Metro Guy-Concordia)