Wednesday, September 14, 2016

Data Science 101, now online

We are delighted to note that IBM's BigDataUniversity.com has launched the quintessential introductory course on data science aptly named Data Science 101.

The target audience for the course is the uninitiated cohort that is curious about data science and would like to take the baby steps to a career in data and analytics. Needless to say, the course is for absolute beginners.

To get a taste of the course, watch the following video "What is Data Science?


Here is the curriculum:

  • Module 1 - Defining Data Science
    • What is data science?
    • There are many paths to data science
    • Any advice for a new data scientist?
    • What is the cloud?
    • "Data Science: The Sexiest Job in the 21st Century"
  • Module 2 - What do data science people do?
    • A day in the life of a data science person
    • R versus Python?
    • Data science tools and technology
    • "Regression"
  • Module 3 - Data Science in Business
    • How should companies get started in data science?
    • R versus Python
    • Tips for recruiting data science people
    • "The Final Deliverable"
  • Module 4 - Use Cases for Data Science
    • Applications for data science
    • "The Report Structure"
  • Module 5 -Data Science People
    • Things data science people say
    • "What Makes Someone a Data Scientist?"
Want to learn more about IBM's Big Data University, Click HERE.

Thursday, September 1, 2016

The X-Factors: Where 0 means 1

Hadley Wickham in a recent blog post mentioned that "Factors have a bad rap in R because they often turn up when you don’t want them." I believe Factors are an even bigger concern. They not only turn up where you don't want them, but they also turn things around when you don't want them to.

Consider the following example where I present a data set with two variables: and y. I represent age in years as 'y' and gender as a binary (0/1) variable as 'x' where 1 represents males.

I compute the means for the two variables as follows:


The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good. 

Now let's see what happens when I convert x into a factor variable using the following syntax:


The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

I compute the average age for males and females as follows:


See what happens when I try to compute the mean for the variable 'male'.


Once you factor a variable, you can't compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.

I recompute the means below. 


Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let's look at the data set below:


Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and as the explanatory variable, R generates the following error:

Factor or no factor, I would prefer my zeros to stay as zeros!