Saturday, March 10, 2018

R: simple for complex tasks, complex for simple tasks

When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.

Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title -- R: simple for complex tasks, complex for simple tasks.

Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one.

Of course, several user-written packages, such as psyche, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming.

Some would argue that the summary command in base R is an option. I humbly disagree.

First, the output from summary is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second, summary command does not report standard deviation.

I teach business analytics to undergraduate and MBA students. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.

So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis in Microsoft Excel instead.

Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.

In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.

I use mtcars dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.

data(mtcars)
head(mtcars)

As stated earlier, one can use summary command to produce descriptive statistics.

summary(mtcars)

Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables: mpg, disp, and hp. One can use the sapply command and generate the three statistics separately and combined them later using the cbind command.

The following command will create a vector of means.

mean.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], mean))

Note that the above syntax requires someone learning R to know the following:

1.    Either to attach the dataset or to use with command so that sapply could recognize variables.
2.    Knowledge of subsetting variables in R
3.    Familiarity with c to combine variables
4.    Being aware of enclosing variable names in quotes

We can use similar syntax to determine standard deviation and the number of observations.

sd.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], sd)); sd.cars
n.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], length)); n.cars

Note that the user needs to know that the command for number of observations is length and for standard deviation is sd.

Once we have the three vectors, we can combine them using cbind that generates the following table.

cbind(n.cars, mean.cars, sd.cars)

     n.cars mean.cars    sd.cars
mpg      32  20.09062   6.026948
disp     32 230.72188 123.938694
hp       32 146.68750  68.562868

Again, one needs to know the round command to restrict the output to a specific number of decimals. See below the output with two decimal points.

round(cbind(n.cars, mean.cars, sd.cars),2)

     n.cars mean.cars sd.cars
mpg      32     20.09    6.03
disp     32    230.72  123.94
hp       32    146.69   68.56

One can indeed use a custom function to generate the same with one command. See below.

round(with(mtcars, t(sapply(mtcars[c("mpg", "disp",  "hp")],
                    function(x) c(n=length(x), avg=mean(x),
                    stdev=sd(x))))), 2)

      n    avg  stdev
mpg  32  20.09   6.03
disp 32 230.72 123.94
hp   32 146.69  68.56

But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.

stargazer* is an R package that simplifies this task. Here is the output from stargazer.

library(stargazer)
stargazer(mtcars[c("mpg", "disp",  "hp")], type="text")

============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
--------------------------------------------

A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:

descriptives(mpg, disp, hp)


*  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.  R package version 5.2. http://CRAN.R-project.org/package=stargazer



6 comments:

  1. I agree, for a quite large part. I think, however, tidyverse makes the simple things at least a little simpler. The task performed above then becomes
    mtcars %>%
    select(mpg, disp, hp) %>%
    gather("Variable", "Value", mpg, disp, hp) %>%
    group_by(Variable) %>%
    summarise_all(funs(length, mean, sd, min, max))

    It still requires the student to learn quite a few things, so perhaps even more useful is to just tell the students that with R, a little bit of googling is part of the work flow. Also, tell the students that https://www.statmethods.net/index.html is a very good resource (which in this case quickly points you to Hmisc::describe, pastecs::stat.desc as well as psych::describe that you mention yourself).

    ReplyDelete
  2. You might want to take a look at Jmovi (https://www.jamovi.org/).

    It's point-and-click but reproducible, and based on R.

    We're starting to use it for researchers who want to do some of their own analysis. I'm thinking it would work for undergrad/MBA students where your goal isn't to teach them R, but for them to understand the analysis/stats concepts.

    ReplyDelete
  3. library(tidyverse)
    mtcars %>%
    select(mpg, disp, hp) %>%
    sapply(function(x) c(n = length(x), mu = mean(x), sigma = sd(x))) %>%
    t %>%
    round(2)

    n mu sigma
    mpg 32 20.09 6.03
    disp 32 230.72 123.94
    hp 32 146.69 68.56

    #################### now without the tidyverse

    mat <- sapply(subset(mtcars, select = c(mpg, disp, hp)),
    function(x) c(n = length(x), mu = mean(x), sigma = sd(x)))
    mat <- round(t(mat), 2)
    mat
    n mu sigma
    mpg 32 20.09 6.03
    disp 32 230.72 123.94
    hp 32 146.69 68.56

    ############ where's the problem?

    ReplyDelete
    Replies
    1. Greetings,

      Thank you for your suggestions. Also, I noted that I used the "with" command unnecessarily in the text. The following syntax works without the "with" option.

      mean.cars <- sapply(mtcars[c("mpg", "disp", "hp")], mean)
      sd.cars <- sapply(mtcars[c("mpg", "disp", "hp")], sd)
      n.cars <- sapply(mtcars[c("mpg", "disp", "hp")], length)

      cbind(n.cars, mean.cars, sd.cars)
      round(cbind(n.cars, mean.cars, sd.cars),2)

      round(t(sapply(mtcars[c("mpg", "disp", "hp")],
      function(x) c(n=length(x), avg=mean(x), stdev=sd(x)))), 2)

      Sincerely, Murtaza

      Delete
  4. Consider the following approach and whether you agree that my description of teaching it is within the ability of undergraduates to grasp.

    library(dplyr) # part of the tidyverse package/ecosystem to streamline processing
    data(mtcars) # even built-in data still has to be loaded explicitly
    fs <- c("mean", "sd") # create a vector of functions to be applied

    # select from mtcars the three columns of interest, pipe them to be summarized (collapsed into
    # a single row) consisting of the mean and standard deviation of each column then
    # pipe to be rounded to three places and finally add a new column for the number of observations
    # ASSUMED to be equal to the number of rows in the data frame; in the real world you have to
    # deal with missing data. With only 32 rows, we can confirm the lack of NAs by visual inspection

    select(mtcars, mpg, disp, hp) %>% summarize_all(funs_(fs)) %>% mutate_all(round, 3) %>% mutate(n = nrow(mtcars))

    I appreciate that in the context of an undergraduate course in business analysis, you shouldn't have to devote a substantial amount of time teaching R. (However, a prospective statistics major who can't master R should consider an alternative career.)

    What I'd suggest is to orient students to R by describing it a tool to solve alegbraic equations. They had to have grasped f(x) = y to make it to your classroom in the first place.

    In this example,

    1. They are going to be using functions provided by a contributed R package.
    2. They need to know how to install the package if they get a not-found error
    3. They need to know ls() to begin developing a notion of namespace, so if they are trying to do something with mtcars and it's a not found object, they can see for themselves that it's not there and they need to do something, which is to use the data function. Eventually, they will need to learn how to read in their own data.
    4. They need to know that <- and = are assignment operators that give a name some property
    5. c means combine into a vector
    6. A vector is a list of objects, analogous to either a row or a column of a spreadsheet
    7. The quotation marks within c are to use the *names* of the functions, not the functions themselves. Otherwise, there's no way of knowing the mean of what?
    8. select is a function from dplyr; it's arguments are the name of the object, the mtcars data frame, the columns or variables to be pulled out.
    9. %>% is a pipe that passes on the results from the left to the right
    10. summarize_all takes the result, a trimmed down data frame, and applies a function
    11. The function it applies is funs_, another dyplr feature that applies the functions in the vector fs to the trimmed down data frame.
    12. Another %>% sends the result downstream to be rounded
    13. Last is to add a column for n

    With that as an example, it's easy to drill.

    Redefine fs to obtain the minimum, maximum and median of mtcars, for mpg, disp and hp

    fs <- c("min", "max", "median")

    Do you need to make any changes to

    select(mtcars, mpg, disp, hp) %>% summarize_all(funs_(fs)) %>% mutate_all(round, 3) %>% mutate(n = nrow(mtcars))

    to get the new result?

    What would you do to get quantiles and IQR, the interquantile range?

    For statistics majors, get them out of cut and paste and into RMarkdown as soon as possible.

    ReplyDelete
  5. So there have been some comments about GUI-based options, and I'd plug one for radiant: http://vnijs.github.io/radiant/. I would also say tableone takes a shot at this as well: https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html

    ReplyDelete