Monday, October 8, 2018

A question and an answer about recoding several factors simultaneously in R

Data manipulation is a breeze with amazing packages like plyr and dplyr. Recoding factors, which could prove to be a daunting task especially for variables that have many categories, can easily be accomplished with these packages. However, it is important for those learning Data Science to understand how the basic R works.

In this regard, I seek help from R specialists about recoding factors using the base R. My question is about why one notation in recoding factors works while the other doesn’t. I’m sure for R enthusiasts, the answer and solution are straightforward. So, here’s the question.

In the following code, I generate a vector with five categories and 300 observations. I convert the vector to a factor and tabulate it.



Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.



Where 1 and 6 correspond to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, would work, but it didn’t. It returns varying and erroneous answers.
However, when I refer to levels explicitly, the script works as intended. See the script below.
Hence the question: Why one method works and the other doesn’t? Looking forward to responses from R experts.

The Answer


lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on stackoverflow. The solution is simple. The following code works:



The problem with my approach, as explained by lebastsnok, is the following:

"levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters."