Saturday, March 10, 2018

R: simple for complex tasks, complex for simple tasks

When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.

Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title -- R: simple for complex tasks, complex for simple tasks.

Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one.

Of course, several user-written packages, such as psyche, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming.

Some would argue that the summary command in base R is an option. I humbly disagree.

First, the output from summary is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second, summary command does not report standard deviation.

I teach business analytics to undergraduate and MBA students. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.

So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis in Microsoft Excel instead.

Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.

In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.

I use mtcars dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.


As stated earlier, one can use summary command to produce descriptive statistics.


Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables: mpg, disp, and hp. One can use the sapply command and generate the three statistics separately and combined them later using the cbind command.

The following command will create a vector of means. = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], mean))

Note that the above syntax requires someone learning R to know the following:

1.    Either to attach the dataset or to use with command so that sapply could recognize variables.
2.    Knowledge of subsetting variables in R
3.    Familiarity with c to combine variables
4.    Being aware of enclosing variable names in quotes

We can use similar syntax to determine standard deviation and the number of observations. = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], sd)); = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], length));

Note that the user needs to know that the command for number of observations is length and for standard deviation is sd.

Once we have the three vectors, we can combine them using cbind that generates the following table.

mpg      32  20.09062   6.026948
disp     32 230.72188 123.938694
hp       32 146.68750  68.562868

Again, one needs to know the round command to restrict the output to a specific number of decimals. See below the output with two decimal points.

mpg      32     20.09    6.03
disp     32    230.72  123.94
hp       32    146.69   68.56

One can indeed use a custom function to generate the same with one command. See below.

round(with(mtcars, t(sapply(mtcars[c("mpg", "disp",  "hp")],
                    function(x) c(n=length(x), avg=mean(x),
                    stdev=sd(x))))), 2)

      n    avg  stdev
mpg  32  20.09   6.03
disp 32 230.72 123.94
hp   32 146.69  68.56

But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.

stargazer* is an R package that simplifies this task. Here is the output from stargazer.

stargazer(mtcars[c("mpg", "disp",  "hp")], type="text")

Statistic N   Mean   St. Dev.  Min     Max  
mpg       32 20.091   6.027   10.400 33.900 
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  

A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:

descriptives(mpg, disp, hp)

*  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.  R package version 5.2.

Wednesday, March 7, 2018

Is it time to ditch the Comparison of Means (T) Test?

For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different.

It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression.

My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models.

Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very paper that introduced the t-statistic under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test.   

My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.

The following are my reasons:

1.       Pedagogy related reasons:
a.       Teaching Regression instead of other intermediary tests will save instructors considerable time that could be used to illustrate the same concepts with examples using Regression.
b.       Given that there are fewer than 39 hours of lecture time available in a single semester introductory course on applied statistics, much of the course is consumed in covering statistical tests that would be redundant if one were to introduce Regression models sooner in the course.
c.        Academic textbooks for undergraduate students in business, geography, psychology dedicate huge sections to a battery of tests that are redundant in the presence of Regression models.
                                                               i.      Consider the widely used textbook Business Statistics by Ken Black that requires students and instructors to leaf through 500 pages before OLS Regression is introduced.
                                                             ii.      The learning requirements of undergraduate and graduate students not enrolled in economics, mathematics, or statistics programs are quite different. Yet most textbooks and courses attempt to turn all students into professional statisticians.
2.       Applied Analytics reasons
a.       OLS Regression model with a continuous dependent variable and a dichotomous explanatory variable produces the exact same output as the standard Comparison of Means Test.
b.       Extending the comparison to more than two groups is a straightforward extension in Regression where the dependent variable will comprise more than two groups.
                                                               i.      In the tradition statistics teaching approach, one advises students that the T-test is not valid to compare the means for more than two groups and that we must switch to learning a new method, ANOVA.
                                                             ii.      You might have caught on my drift that I am also proposing to replace teaching ANOVA in introductory statistics courses with OLS Regression.
c.        A Comparison of Means Test illustrated as a Regression model is much easier to explain than explaining the output from a conventional T-test.

After introducing Normal and T-distributions, I would, therefore, argue that instructors should jump straight to Regression models.

 Is Regression a valid substitute for T-tests?

In the following lines, I will illustrate that the output generated by OLS Regression models and a Comparison of Means test are identical. I will illustrate examples using Stata and R.


I will use Professor Daniel Hamermesh’s data on teaching ratings to illustrate the concepts. In a popular paper, Professor Hamermesh and Amy Parker explore whether good looking professors receive higher teaching evaluations. The dataset comprises teaching evaluation score, beauty score, and instructor/course related metrics for 463 courses and is available for download in R, Stata, and Excel formats at: 

Hypothetically Speaking

Let us test the hypothesis that the average beauty score for male and female instructors is statistically different. The average (normalized) beauty score for male instructors was -0.084 for male instructors and 0.116 for female instructors. The following Box Plot illustrates the difference in beauty scores. The question we would like to answer is whether this apparent difference is statistically significant.

I first illustrate the Comparison of Means Test and OLS Regression assuming Equal Variances in Stata.

 Assuming Equal Variances (Stata)

Download data
       use ""
encode gender, gen(sex) // To convert a character variable into a numeric variable.

The T-test is conducted using:

       ttest beauty, by(sex)

The above generates the following output:

As per the above estimates, the average beauty score of female instructors is 0.2 points higher and the t-test value is 2.7209. We can generate the same output by running an OLS regression model using the following command:

reg beauty

The regression model output is presented below.

Note that the average beauty score for male instructors is -0.2 points lower than that of females and the associated standard errors and t-values (highlighted in yellow) are identical to the ones reported in the Comparison of Means test. 

Unequal Variances

But what about unequal variances? Let us first conduct the t-test using the following syntax:

ttest beauty, by(sex) unequal

The output is presented below:

Note the slight change in standard error and the associated t-test.

To replicate the same results with a Regression model, we need to run a different Stata command that estimates a variance weighted least squares regression. Using Stata’s vwls command:

vwls beauty

Note that the last two outputs are identical. 

Repeating the same analysis in R

To download data in R, use the following syntax:

url = ""

For equal variances, the following syntax is used for T-test and the OLS regression model.
t.test(beauty ~ gender, var.equal=T)
summary(lm(beauty ~ gender))

The above generates the following identical output as Stata.

For unequal variances, we need to install and load the nlme package to run a gls version of the variance weighted least square Regression model.

t.test(beauty ~ gender)
summary(gls(beauty ~ gender,  weights=varIdent(form = ~ 1 | gender)))

The above generates the following output:

So there we have it, OLS Regression is an excellent substitute for the Comparison of Means test.