Thursday, September 1, 2016

The X-Factors: Where 0 means 1

Hadley Wickham in a recent blog post mentioned that "Factors have a bad rap in R because they often turn up when you don’t want them." I believe Factors are an even bigger concern. They not only turn up where you don't want them, but they also turn things around when you don't want them to.

Consider the following example where I present a data set with two variables: and y. I represent age in years as 'y' and gender as a binary (0/1) variable as 'x' where 1 represents males.

I compute the means for the two variables as follows:

The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good. 

Now let's see what happens when I convert x into a factor variable using the following syntax:

The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

I compute the average age for males and females as follows:

See what happens when I try to compute the mean for the variable 'male'.

Once you factor a variable, you can't compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.

I recompute the means below. 

Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let's look at the data set below:

Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and as the explanatory variable, R generates the following error:

Factor or no factor, I would prefer my zeros to stay as zeros!


  1. Hello.
    And how do you avoid this problem? How do you force R to convert females to 0 and males to 1?
    Maybe with levels?

    1. You need to do as.numeric(as.character(()) instead of just as.numeric()

  2. Nice post.Obviously you can use as.numeric(as.character(x)) as pointed out above but I think this behavior is particularly a problem when you use functions like summary, etc..Especially when you're trying to teach the language to beginners, it just drives them insane. At the same time it means that they have to identify and work on the categorical variables of interest first. That means getting acquainted with the data, which is always a plus.Most students with a decent understanding of statistics would also work out that 1.45 is meaningless. However overall I do think an alternative base summary function could be created which gets around these problems (and maybe reports S.D. for numeric variables as you mentioned in another post.

    What the post maybe should mention is that this complication comes with some benefits. R allows you to simply ask for mean(data$sex=="male"). I would call it sex because if your variable was called male I would just have a binary 0,1 variable not a factor (but this is beyond the point you're trying to make).

    The ability to use actual characters in a mean query, not to mention in a regression a real advantage compared to other software. It means I don't have to use meaningless codes such as 0,1,2 for categorical variables and instead I can keep the actual "human" meaning inside the variable.