eKonometrics

Handling Large Investment Datasets with R: A Powerful Solution to Data Management Challenges

2024-10-06T12:53:00.001-04:00

Dealing with large datasets—where observations run into the millions and file sizes reach gigabytes or more—can be daunting for many data practitioners. However, there is no shortage of specialized tools, many of which are open source, that offer efficient solutions for such challenges.

I illustrate the challenges of handling large data sets with a solution. Recently, Statistics Canada released a comprehensive dataset on investor-owned condominiums. I intend to conduct an in-depth analysis of this data in the near future. But even before diving into that, I want to illustrate the superior data-handling capabilities of R, specifically the data.table package, and how it enables rapid and efficient data manipulation.

The Dataset: Size and Complexity

To set the context, the dataset in its compressed (zipped) form was around 244 MB. At first glance, this may seem like a small file; most modern software can easily handle this size. However, once uncompressed, the CSV file expands to a considerable 5.86 GB, comprising 24,227,546 rows. For context, this scale of data can be quite demanding for analysts working on standard laptops or desktops with limited RAM and processing power. Handling and processing such a large dataset requires a robust toolset, and this is where R shines.
The Solution: Efficient Data Handling with data.table in R

R has long been a staple in the toolkit of statisticians and data scientists and its data.table package is particularly powerful when working with large datasets. After downloading the "entire table" comprising the investor condominium data from Statistics Canada's website, my initial task was to import and inspect the dataset's contents. Using data.table, not only was I able to load the full 5.86 GB dataset within seconds, but I also efficiently navigated through the rows and columns, removing unnecessary columns and optimizing the dataset for future analysis.

The magic lies in how data.table handles memory and its optimized use of indexing, which greatly reduces the time needed to import and manipulate data. While R’s base functions are capable, data.table goes a step further by providing faster processing speeds and a syntax tailored for concise and efficient operations.

Compression and Storage Efficiency

One surprising outcome of this exercise was the significant reduction in file size after processing. The original zipped file, at 244 MB, ballooned to 5.86 GB when unzipped as a CSV. However, after importing the data into R, performing some light cleaning, and saving it in R’s native format (.rda or .RData), the file size was reduced to a mere 18.6 MB on the hard drive. This is a remarkable reduction from the original CSV and far smaller than the original zipped format.

Read the rest of the blog at https://www.linkedin.com/pulse/handling-large-investment-datasets-r-powerful-solution-murtaza-haider-3j2tc/.

How to append two tables in R Markdown?

2022-09-27T14:16:00.006-04:00

Here is the task: how to append two tables using R Markdown? The need arose because I was demonstrating to graduate students in a research methods course how to prepare Table 1, which often covers descriptive statistics in an empirical paper.

I used tableone package in R to compute the summary statistics. The task was to replicate the first table from Prof. Daniel Hamermesh’s paper that explored whether instructors’ appearance and looks influenced the teaching evaluation score assigned by the students. Since Prof. Hammermesh computed some summary statistics using weighted data, such as weighted mean and weighted standard deviations, and non-weighted data using regular means and standard deviations, I relied on two different commands in tableone to compute summary statistics.

The challenge was to combine the output from the two tables into one table. Once I generated the two tables separately, I used kables() and list() options to generate the appended table. I needed knitr and kableExtra packages to format the table. Here is how the apended looks.

Here are the steps involved.

Assume that you have two tables generated by either svyCreateTableOne or CreateTableOne commands. Let’s store the results in objects tab1 and tab2.

In R Markdown using RStudio, print the tables to objects named arbitrarily as p and p1. See the code below. The results=’hide’ is needed if you do not want to see the tables outputted in the draft as text.

```{r, echo=FALSE, results='hide'}

p <- print(tab1)

p2 <-print(tab2)

```

The amalgamated table used the following script. Note some import considerations.

I used bottomrule=NULL to suppress the horizontal line for the table on the top.
I used column_spec(1, width = '1.75in') for both tables so that the second and subsequent columns lineup vertically. Otherise, they will appear staggered.
I used col.names = NULL to suppress column names for the bottom table because the column names are the same for both tables.
I used column_spec(5, width = '.7in') to ensure that the horizontal lines drawn for the bottom table match the width of the horizontal line on top of the first table.
I used kable_styling(latex_options = "HOLD_position") to ensure that the table appears at the correct place in the text.

I wish there was an easy command to fix the table width, but I didn’t find one. Still, I am quite pleased with the final output. I look forward to seeing ideas on improving the layout of appended tables.

```{r, echo=FALSE}

kables(

list(

kable(p, booktabs=TRUE, format = "latex",valign='t', bottomrule=NULL) %>%

column_spec(1, width = '1.75in'),

kable(p2, booktabs=TRUE, format = "latex", valign='t', col.names = NULL) %>%

column_spec(1, width = '1.75in') %>%
column_spec(5, width = '.7in')),

caption="Weighted and unweighted data") %>%
kable_styling(latex_options = "HOLD_position")

```

Modern Data Science with R: A review

2019-05-20T23:29:00.000-04:00

Some say data is the new oil. Others equate its worth to water. And then there are those who believe that data scientists will be (in fact, they already are) one of the most sought-after workers in knowledge economies.

Millions of data-centric jobs require millions of trained data scientists. However, the installed capacity of graduate and undergraduate programs in data science is nowhere near meeting this demand over the next many years.

So how do we produce data scientists?

Given the enormous demand for data scientists and the fixed supply from higher education institutions, it is quite likely that one must look beyond colleges and universities to train a large number of data scientists desired by the marketplace.

Getting trained on the job is one possible route. This will require repurposing the existing workforce. To prepare the current workforce for data science, one needs training manuals. One such manual is Modern Data Science with R (MDSR).

Published by the CRC Press (Taylor and Francis Group) and authored by three leading academics: Professors Baumer, Kaplan, and Horton, MDSR is the missing manual for data science with R. The book is equally relevant to data science programs in higher ed as it is to practitioners who would like to embark on a career in data science or to get a taste of an aspect of data science that they have not explored in the past.

As the book’s name suggests, the text is based on R, one of the most popular and versatile computing platforms. R is a freeware and is being developed by thousands of volunteers in real time. In addition to base R, which comes bundled with thousands of commands and functions, the user-written packages, whose number has exceeded 14,000 (as of May 2019), further expand the universe of features making R perhaps the most diverse computing platform.

Despite the immense popularity of data science, only a handful of titles focus exclusively on the topic. Hadley Wickham’s R for Data Science and R Cookbook by Paul Teetor are the other two other worthy texts. MDSR is unique in the sense that it serves as an introduction to a whole host of analytic techniques that are seldom covered in one title.

In the following paragraphs, I’ll discuss the salient features of the textbook. I begin with my favourite attribute of the book that deals with its organization. Instead of muddling with theories and philosophies, the book gets straight to business and starts the conversation with data visualization. A graphic is worth a thousand words, and MDSR is proof of it.

And since Hadley Wickham’s influence on data science is ubiquitous, MDSR also embraces Wickham’s implementation of Grammar of Graphics in R with one of the most popular R packages, ggplot2.

Another avenue where Wickham’s influence is widely felt is data wrangling. A suite of R packages bundled under the broader rubric of Tidyverse is influencing how data scientists manipulate small and big data. Chapter 4 in MDSR perhaps is one of the best and succinct introduction to data wrangling with R and Tidyverse. From the simplest to more advanced examples, MDSR equips the beginner with the basics and the advanced user with new ways to think about analyzing data.

A key feature of MDSR is that it’s not another book on statistics or econometrics with R. Yours truly is guilty of authoring one such book. Instead, MDSR is a book focused squarely on data manipulation. The treatment of statistical topics is not absent from the book; however, it’s not the book’s focus. It is for this reason that the discussion on Regression models is in the appendices.

But make no mistake, even when statistics is not the focus, MDSR offers sound advice on the practice of statistics. Section 7.7, The perils of p-values, warns the novice statisticians about not becoming the unsuspecting victims of hypothesis testing.

The books distinguishing feature remains the diversity of data science challenges it addresses. For instance, in addition to data visualization, the book offers an introduction to interactive data graphics and dynamic data visualization.

At the same time, the book covers other diverse topics, such as database management with SQL, working with spatial data, analyzing text-based (non-structured) data, and the analysis of networks. A discussion about ethics in data science is undoubtedly a welcome feature in the book.

The book is punctuated with hundreds of useful and hands-on data science examples and exercise, providing ample opportunities to put concepts to practise. The book’s accompanying website offers additional resources and code examples. At the time of this review, not all code was available for download.

Also, while I was able to reproduce more straightforward examples, I ran into trouble with complex ones. For instance, I could not generate advanced spatial maps showing flights origins and destinations.

My recommendation to authors will be to maintain an active supporting website because R packages are known to evolve, and some functionality may change or disappear over time. For instance, the mapping algorithms that are part of the ggmap package now require a Google maps API or else the maps will not display. This change has likely occurred after the book was published.

In summary, for aspiring and experienced data scientists, Modern Data Science with R is a book deserving to be in their personal libraries.

Murtaza Haider lives in Toronto and teaches in the Department of Real Estate Management at Ryerson University. He is the author of Getting Started with Data Science: Making Sense of Data with Analytics, which was published by the IBM Press/Pearson.

A question and an answer about recoding several factors simultaneously in R

2018-10-08T12:56:00.001-04:00

Data manipulation is a breeze with amazing packages like plyr and dplyr. Recoding factors, which could prove to be a daunting task especially for variables that have many categories, can easily be accomplished with these packages. However, it is important for those learning Data Science to understand how the basic R works.

In this regard, I seek help from R specialists about recoding factors using the base R. My question is about why one notation in recoding factors works while the other doesn’t. I’m sure for R enthusiasts, the answer and solution are straightforward. So, here’s the question.

In the following code, I generate a vector with five categories and 300 observations. I convert the vector to a factor and tabulate it.

Note that by using as.numeric option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories a and f as missing. I can accomplish this with the following code.

Where 1 and 6 correspond to a and f.

Note that I have used the position of the levels rather than the levels themselves to convert the values to missing.

So far so good.

Now let’s assume that I would like to convert categories a and f to grades. The following code, I thought, would work, but it didn’t. It returns varying and erroneous answers.

However, when I refer to levels explicitly, the script works as intended. See the script below.

Hence the question: Why one method works and the other doesn’t? Looking forward to responses from R experts.

The Answer

lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on stackoverflow. The solution is simple. The following code works:

The problem with my approach, as explained by lebastsnok, is the following:

"levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters."

Edward Tufte’s Slopegraphs and political fortunes in Ontario

2018-05-21T10:21:00.002-04:00

With fewer than three weeks left in the June 7 provincial elections in Ontario, Canada’s most populous province with 14.2 million persons, the expected outcome is far from certain.

The weekly opinion polls reflect the volatility in public opinion. Progressive Conservatives (PC), one of the main opposition parties, is in the lead with the support of roughly 40 percent of the electorate. The incumbent Ontario Liberals are trailing with their support hanging around lower 20 percent.

The real story in these elections is the unexpected rise in the fortunes of the New Democratic Party (NDP) that has seen a sustained increase in its popularity from less than 20 percent a few weeks ago to mid 30 percent.

As a data scientist/journalist, I have been concerned with how best to represent this information. A scatter plot of sorts would do. However, I would like to demonstrate the change in political fortunes over time with the x-axis representing time. Hence, a time series chart would be more appropriate.

Ideally, I would like to plot what Edward Tufte called a Slopegraph. Tufte, in his 1983 book The Visual Display of Quantitative Information, explained that “Slopegraphs compare changes usually over time for a list of nouns located on an ordinal or interval scale”.

But here’s the problem. No software offers a readymade solution to draw a Slopegraph.

Luckily, I found a way, in fact, two ways, around the challenge with help from colleagues at Stata and R (plotrix).

So, what follows in this blog is the story of the elections in Ontario described with data visualized as Slopegraphs. I tell the story first with Stata and then with the plotrix package in R.

My interest grew in Slopegraphs when I wanted to demonstrate the steep increase in highly leveraged mortgage loans in Canada from 2014 to 2016. I generated the chart in Excel and sent it to Stata requesting help to recreate it.

Stata assigned my request to Derek Wagner whose excellent programming skills resulted in the following chart.

Derek built the chart on the linkplot command built by the uber Stata guru, Professor Nicholas J. Cox. However, a straightforward application of linkplot still required a lot of tweaks that Derek very ably managed. For comparison, see the initial version of the chart generated by linkplot below.

We made the following modifications to the base linkplot:

1. Narrow the plot by reducing the space between the two time periods.

2. Label the entities and their respective values at the primary and secondary y-axes.

3. Add a title and footnotes (if necessary).

4. Label time periods with custom names.

5. Colour lines and symbols to match preferences.

Once we apply these tweaks a Slopegraph with the latest poll data for Ontario’s election is drawn as follows.

Notice that in fewer than two weeks, NDP has jumped from 29 percent to 34 percent, almost tying up with the leading PC party whose support has remained steady at 35 percent. The incumbent Ontario Liberals appear to be in free fall from 29 percent to 24 percent.

I must admit that I have sort of cheated in the above chart. Note that both Liberals and NDP secured 29 percent of the support in the poll conducted on May 06. In the original chart drawn with Stata’s code, their labels overlapped resulting in unintelligible text. I fixed this manually by manipulating the image in PowerPoint.

I wanted to replicate the above chart in R. I tried a few packages, but nothing really worked until I landed on the plotrix package that carries the bumpchart command. In fact, Edward Tufte in Beautiful Evidence (2006) mentions that bumpcharts may be considered as slopegraphs.

A straightforward application of bumpchart from the plotrix package labelled the party names but not the respective percentages of support each party commanded.

Dr. Jim Lemon authored bumpchart. I turned to him for help. Jim was kind enough to write a custom function, bumpchart2, that I used to create a Slopegraph like the one I generated with Stata. For comparison, see the chart below.

As with the Slopegraph generated with Stata, I manually manipulated the labels to prevent NDP and Liberal labels from overlapping.

Data Scientist must dig even deeper

The job of a data scientist, unlike a computer scientist or a statistician, is not done by estimating models and drawing figures. A data scientist must tell a story with all caveats that might apply. So, here’s the story about what can go wrong with polls.

The most important lesson about forecasting from Brexit and the last US Presidential elections is that one cannot rely on polls to determine the future electoral outcomes. Most polls in the UK predicted a NO vote for Brexit. In the US, most polls forecasted Hillary Clinton to be the winner. Both forecasts went horribly wrong.

When it comes to polls, one must determine who sponsored the poll, what methods were used, and how representative is the sample of the underlying population. Asking the wrong question to the right people or posing the right question to the wrong people (non-representative sample) can deliver problematic results.

Polling is as much science as it is arts. Late Warren Mitofsky, who pioneered exit polls and innovated political survey research, remains a legend in political polling. His painstakingly cautious approach to polling is why he remains a respected name in market research.

Today, the advances in communication and information technologies have made survey research easier to conduct but more difficult to be precise. No longer can one rely on random digit dialling, a Mitosky innovation, to reach a representative sample. Younger cohorts sparingly subscribe to land telephone lines. The attempts to catch them online poses the risk of fishing for opinions in echo chambers.

Add political polarization to technological challenges, and one realizes the true scope of the difficulties inherent in the task of taking the political pulse of an electorate where motivated pollster may be after not the truth, but a convenient version of it.

Polls also differ by survey instrument, methodology, and sample size. The Abacus Data poll presented above is essentially an online poll of 2,326 respondents. In comparison, a poll by Mainstreet Research used Interactive Voice Response (IVR) system with a sample size of 2,350 respondents. IVR uses automated computerized responses over the telephone to record responses.

Abacus Data and Mainstreet Research use quite different methods with similar sample sizes. Professor Dan Cassino of Fairleigh Dickinson University explained the challenges with polling techniques in a 2016 article in the Harvard Business Review. He favours live telephone interviewers who “are highly experienced and college educated and paying them is the main cost of political surveys.”

Professor Cassino believes that techniques like IVR make “polling faster and cheaper,” but these systems are hardly foolproof with lower response rates. They cannot legally reach cellphones. “IVR may work for populations of older, whiter voters with landlines, such as in some Republican primary races, but they’re not generally useful,” explained Professor Cassino.

Similarly, online polls are limited in the sense that in the US alone 16 percent Americans don’t use the Internet.

With these caveats in mind, a plot of Mainstreet Research data reveals quite a different picture where the NDP doesn’t seem to pose an immediate and direct challenge to the PC party.

So, here’s the summary. Slopegraph is a useful tool to summarize change over time between distinct entities. Ontario is likely to have a new government on June 7. It is though, far from being certain whether the PC Party or NDP will assume office. Nevertheless, Slopergaphs generate visuals that expose the uncertainty in the forthcoming elections.

Note: To generate the charts in this blog, you can download data and code for Stata and Plotrix (R) by clicking HERE.

R: simple for complex tasks, complex for simple tasks

2018-03-10T20:33:00.000-05:00

When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.

Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title -- R: simple for complex tasks, complex for simple tasks.

Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one.

Of course, several user-written packages, such as psyche, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming.

Some would argue that the summary command in base R is an option. I humbly disagree.

First, the output from summary is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second, summary command does not report standard deviation.

I teach business analytics to undergraduate and MBA students. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.

So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis in Microsoft Excel instead.

Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.

In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.

I use mtcars dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.

data(mtcars)

head(mtcars)

As stated earlier, one can use summary command to produce descriptive statistics.

summary(mtcars)

Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables: mpg, disp, and hp. One can use the sapply command and generate the three statistics separately and combined them later using the cbind command.

The following command will create a vector of means.

mean.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], mean))

Note that the above syntax requires someone learning R to know the following:

1. Either to attach the dataset or to use with command so that sapply could recognize variables.

2. Knowledge of subsetting variables in R

3. Familiarity with c to combine variables

4. Being aware of enclosing variable names in quotes

We can use similar syntax to determine standard deviation and the number of observations.

sd.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], sd)); sd.cars

n.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], length)); n.cars

Note that the user needs to know that the command for number of observations is length and for standard deviation is sd.

Once we have the three vectors, we can combine them using cbind that generates the following table.

cbind(n.cars, mean.cars, sd.cars)

n.cars mean.cars sd.cars

mpg 32 20.09062 6.026948

disp 32 230.72188 123.938694

hp 32 146.68750 68.562868

Again, one needs to know the round command to restrict the output to a specific number of decimals. See below the output with two decimal points.

round(cbind(n.cars, mean.cars, sd.cars),2)

n.cars mean.cars sd.cars

mpg 32 20.09 6.03

disp 32 230.72 123.94

hp 32 146.69 68.56

One can indeed use a custom function to generate the same with one command. See below.

round(with(mtcars, t(sapply(mtcars[c("mpg", "disp", "hp")],

function(x) c(n=length(x), avg=mean(x),

stdev=sd(x))))), 2)

n avg stdev

mpg 32 20.09 6.03

disp 32 230.72 123.94

hp 32 146.69 68.56

But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.

stargazer* is an R package that simplifies this task. Here is the output from stargazer.

library(stargazer)

stargazer(mtcars[c("mpg", "disp",  "hp")], type="text")

============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
--------------------------------------------

A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:

descriptives(mpg, disp, hp)

* Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2. http://CRAN.R-project.org/package=stargazer

Is it time to ditch the Comparison of Means (T) Test?

2018-03-07T10:02:00.002-05:00

For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different.

It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression.

My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models.

Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very paper that introduced the t-statistic under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test.

My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.

The following are my reasons:

1. Pedagogy related reasons:

a. Teaching Regression instead of other intermediary tests will save instructors considerable time that could be used to illustrate the same concepts with examples using Regression.

b. Given that there are fewer than 39 hours of lecture time available in a single semester introductory course on applied statistics, much of the course is consumed in covering statistical tests that would be redundant if one were to introduce Regression models sooner in the course.

c. Academic textbooks for undergraduate students in business, geography, psychology dedicate huge sections to a battery of tests that are redundant in the presence of Regression models.

i. Consider the widely used textbook Business Statistics by Ken Black that requires students and instructors to leaf through 500 pages before OLS Regression is introduced.

ii. The learning requirements of undergraduate and graduate students not enrolled in economics, mathematics, or statistics programs are quite different. Yet most textbooks and courses attempt to turn all students into professional statisticians.

2. Applied Analytics reasons

a. OLS Regression model with a continuous dependent variable and a dichotomous explanatory variable produces the exact same output as the standard Comparison of Means Test.

b. Extending the comparison to more than two groups is a straightforward extension in Regression where the dependent variable will comprise more than two groups.

i. In the tradition statistics teaching approach, one advises students that the T-test is not valid to compare the means for more than two groups and that we must switch to learning a new method, ANOVA.

ii. You might have caught on my drift that I am also proposing to replace teaching ANOVA in introductory statistics courses with OLS Regression.

c. A Comparison of Means Test illustrated as a Regression model is much easier to explain than explaining the output from a conventional T-test.

After introducing Normal and T-distributions, I would, therefore, argue that instructors should jump straight to Regression models.

Is Regression a valid substitute for T-tests?

In the following lines, I will illustrate that the output generated by OLS Regression models and a Comparison of Means test are identical. I will illustrate examples using Stata and R.

Dataset

I will use Professor Daniel Hamermesh’s data on teaching ratings to illustrate the concepts. In a popular paper, Professor Hamermesh and Amy Parker explore whether good looking professors receive higher teaching evaluations. The dataset comprises teaching evaluation score, beauty score, and instructor/course related metrics for 463 courses and is available for download in R, Stata, and Excel formats at:

https://sites.google.com/site/statsr4us/intro/software/rcmdr-1

Hypothetically Speaking

Let us test the hypothesis that the average beauty score for male and female instructors is statistically different. The average (normalized) beauty score for male instructors was -0.084 for male instructors and 0.116 for female instructors. The following Box Plot illustrates the difference in beauty scores. The question we would like to answer is whether this apparent difference is statistically significant.

I first illustrate the Comparison of Means Test and OLS Regression assuming Equal Variances in Stata.

Assuming Equal Variances (Stata)

Download data

use "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/teachingratings.dta"

encode gender, gen(sex) // To convert a character variable into a numeric variable.

The T-test is conducted using:

ttest beauty, by(sex)

The above generates the following output:

As per the above estimates, the average beauty score of female instructors is 0.2 points higher and the t-test value is 2.7209. We can generate the same output by running an OLS regression model using the following command:

reg beauty i.sex

The regression model output is presented below.

Note that the average beauty score for male instructors is -0.2 points lower than that of females and the associated standard errors and t-values (highlighted in yellow) are identical to the ones reported in the Comparison of Means test.

Unequal Variances

But what about unequal variances? Let us first conduct the t-test using the following syntax:

ttest beauty, by(sex) unequal

The output is presented below:

Note the slight change in standard error and the associated t-test.

To replicate the same results with a Regression model, we need to run a different Stata command that estimates a variance weighted least squares regression. Using Stata’s vwls command:

vwls beauty i.sex

Note that the last two outputs are identical.

Repeating the same analysis in R

To download data in R, use the following syntax:

url = "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/TeachingRatings.rda"

download.file(url,"TeachingRatings.rda")

load("TeachingRatings.rda")

For equal variances, the following syntax is used for T-test and the OLS regression model.

t.test(beauty ~ gender, var.equal=T)

summary(lm(beauty ~ gender))

The above generates the following identical output as Stata.

For unequal variances, we need to install and load the nlme package to run a gls version of the variance weighted least square Regression model.

t.test(beauty ~ gender)

install.packages(“nlme”)

library(nlme)

summary(gls(beauty ~ gender, weights=varIdent(form = ~ 1 | gender)))

The above generates the following output:

So there we have it, OLS Regression is an excellent substitute for the Comparison of Means test.

When it comes to Amazon's HQ2, you should be careful what you wish for

2018-01-28T12:30:00.000-05:00

By Murtaza Haider and Stephen Moranis
Note: This article originally appeared in the Financial Post on January 25, 2018

Amazon.com Inc. has turned the search for a home for its second headquarters (HQ2) into an episode of The Bachelorette, with cities across North America trying to woo the online retailer.

The Seattle-based tech giant has narrowed down the choice to 20 cities, with Toronto being the only Canadian location in the running.

While many in Toronto, including its mayor, are hoping to be the ideal suitor for Amazon HQ2, one must be mindful of the challenges such a union may pose.

Amazon announced in September last year that its second headquarters will employ 50,000 high-earners with an average salary of US$100,000. It will also require 8 million square feet (SFT) of office and commercial space.

A capacity-constrained city with a perennial shortage of affordable housing and limited transport capacity, Toronto may be courting trouble by pursuing thousands of additional highly-paid workers. If you think housing prices and rents are unaffordable now, wait until the Amazon code warriors land to fight you for housing or a seat on the subway.

The tech giants do command a much more favourable view in North America than they do in Europe. Still, their reception varies, especially in the cities where these firms are domiciled. Consider San Francisco, which is home to not one but many tech giants and ever mushrooming startups. The city attracts high-earning tech talent from across the globe to staff innovative labs and R&D departments.

These highly paid workers routinely outbid locals and other workers in housing and other markets. No longer can one ask for a conditional sale offer that is subject to financing because a 20-something whiz kid will readily pay cash to push other bidders aside.

We wonder whether Toronto’s residents, or those of whichever city ultimately wins Amazon’s heart, will face the same competition from Amazon employees as do the residents of Seattle? The answer lies in the relative affordability gap.

Amazon employees with an average income of US$100,000 will compete against Toronto residents whose individual median income in 2015 was just $30,089. It is quite likely that the bidding wars that high-earning tech workers have won hands down in other cities will end in their favour in the city chosen for Amazon HQ2.

While we are mindful of the challenges that Amazon HQ2 may pose for a capacity-constrained Toronto, we are also alive to the opportunities it will present. For starters, Toronto can use 50,000 high-paying jobs.

GIG ECONOMY

The emergence of the gig economy has had an adverse impact in the City of Toronto, where the employment growth has largely concentrated in the part-time category. Between 2006 and 2016, full-time jobs grew by a mere 8.7 per cent in Toronto, while the number of part-time jobs grew at four times that rate.

While being the largest employment hub in Canada, with an inventory of roughly 180 million square feet, an influx of 8 million square feet of first-rate office space will improve the overall quality of commercial real estate in Toronto. It could also be a boon for office construction and a significant source of new property tax revenue for the city.

But those hoping the city itself might make money should seriously consider the fate of cities lucky enough to host the Olympics, which more often than not end up costing cities billions more than they budgeted for.

Toronto may still pursue Amazon HQ2, but it should do so with the full knowledge of its strengths and vulnerabilities. At the very least, it should create contingency plans to address the resulting infrastructure deficit (not just public transit) and housing affordability issues before it throws open its doors for Amazon.

Murtaza Haider is an associate professor at Ryerson University. Stephen Moranis is a real estate industry veteran. They can be reached at info@hmbulletin.com.

Did the cold weather put a chill on Toronto’s housing market?

2017-06-16T12:00:00.000-04:00

Toronto’s housing market took a dive in May. After years of record highs in housing sales and prices, the hype seems to have evaporated. While some link the slowdown to the Ontario government’s legislation to tighten lending in housing markets, one should also factor in the unusually cold, dark, and wet weather in May that felt more like a ‘May-be.'

Housing sales in the greater Toronto area (GTA) were down 23% last month from a year earlier. However, the average sales price was 14.8% higher than the price in May 2016. On a month-by-month basis, housing prices in May were down by 6% than the prices in April.

The declining numbers have alarmed homebuyers, sellers, brokerages, and governments. Many are questioning if the Ontario government’s intervention has a more adverse impact than was intended. Homebuyers, who have not yet closed on properties, are wondering whether they have paid too much, while sellers are rushing to list properties to benefit from high housing prices that appear to be past their peak.

While it is too early to determine the ‘causal impact’ of the legislative changes introduced in April, which included a 15% tax on foreign home buyers, one must also consider other mitigating factors that might have affected Toronto’s housing market. We must even consider the influence of the weather.

The unusually cold weather in May might have had a chilling effect on housing sales. Typically, housing markets start to heat up in April while being in synch with the rising temperatures. May 2017 was unusually wet. Toronto received a total of 157 mm of precipitation last month compared to 25 mm a year ago. The unusually high rainfall caused flooding all across Ontario. In downtown Toronto, Lake Ontario water rushed into lakefront condos. At the same time, May 2017 was unusually cooler than last year. The average temperature last month was 12 degrees Celsius compared to 16 degrees in May 2016. May was also unusually dark with much less sunshine. However, Toronto saw this trend since January 2017 when it received a mere 50 hours of sunlight compared to the seasonal average of 85 hours.

So why should an unusually cold, dark, and wet weather have any impact on housing markets? Research has shown that weather and atmosphere influence consumer behavior. Retail experts call this phenomenon ‘store atmospherics’ where a store’s environment is altered to enhance consumer behavior that may promote sales. It applies to housing markets as well. Researchers discovered that adverse weather has a significant, yet the short-term effect on economic activity. Writing in Real Estate Economics, John Goodman Jr. found a slight adverse impact of unseasonable weather on housing markets. In related work, researchers found that sale prices of homes with central air-conditioning and swimming pools are higher for sales recorded in summer months.

There are other factors to consider in assessing the market dip. The Ontario government’s regulations to tighten housing markets could have encouraged some homebuyers to advance their purchase to avoid uncertainty. The government’s plans to impose new restrictions on housing markets were known in advance of their announcement in April. Investors are risk and uncertainty averse. Hence some homebuyers could have advanced their purchase to March when the sales unexpectedly jumped by 50% over February 2017. As for those who could not advance their purchase to March, they may have decided to sit through this confusion and wait for calmer markets to prevail.

In earlier research, we documented a similar trend for housing sales in Toronto, when sales escalated in 2007 in advance of Toronto’s new land transfer tax, which was implemented in February 2008. The additional sales recorded in 2007 meant that fewer sales were realized in 2008. The sales activity returned to the long-term trends in a couple of years.

And if this was not enough, financial troubles at the alternative mortgage lender, Home Capital, spooked borrowers who were not deemed mortgage worthy by the mainstream Canadian banks. Many real estate professionals believe the cumulative effect of unseasonal weather, tightening of mortgage regulations, and troubles at alternative lenders were likely the reason behind the declining housing sales and prices.

The roof is not collapsing on Toronto’s housing market. The decline in sales and prices is a rational response by homebuyers and sellers who are reacting to Ontario government’s initiatives to tighten lending in housing markets. The cold, dark, and wet weather certainly did not help either.

Data Science 101, now online

2016-09-14T21:09:00.000-04:00

We are delighted to note that IBM's BigDataUniversity.com has launched the quintessential introductory course on data science aptly named Data Science 101.

The target audience for the course is the uninitiated cohort that is curious about data science and would like to take the baby steps to a career in data and analytics. Needless to say, the course is for absolute beginners.

To get a taste of the course, watch the following video "What is Data Science?

Here is the curriculum:

Module 1 - Defining Data Science
- What is data science?
- There are many paths to data science
- Any advice for a new data scientist?
- What is the cloud?
- "Data Science: The Sexiest Job in the 21st Century"
Module 2 - What do data science people do?
- A day in the life of a data science person
- R versus Python?
- Data science tools and technology
- "Regression"
Module 3 - Data Science in Business
- How should companies get started in data science?
- R versus Python
- Tips for recruiting data science people
- "The Final Deliverable"
Module 4 - Use Cases for Data Science
- Applications for data science
- "The Report Structure"
Module 5 -Data Science People
- Things data science people say
- "What Makes Someone a Data Scientist?"

Want to learn more about IBM's Big Data University, Click HERE.

The X-Factors: Where 0 means 1

2016-09-01T16:30:00.006-04:00

Hadley Wickham in a recent blog post mentioned that "Factors have a bad rap in R because they often turn up when you don’t want them." I believe Factors are an even bigger concern. They not only turn up where you don't want them, but they also turn things around when you don't want them to.

Consider the following example where I present a data set with two variables: x and y. I represent age in years as 'y' and gender as a binary (0/1) variable as 'x' where 1 represents males.

I compute the means for the two variables as follows:

The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good.

Now let's see what happens when I convert x into a factor variable using the following syntax:

The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

I compute the average age for males and females as follows:

See what happens when I try to compute the mean for the variable 'male'.

Once you factor a variable, you can't compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.

I recompute the means below.

Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let's look at the data set below:

Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and y as the explanatory variable, R generates the following error:

Factor or no factor, I would prefer my zeros to stay as zeros!

Five Questions about Data Science

2016-08-22T01:57:00.000-04:00

From Safari Books Online (https://www.safaribooksonline.com/blog/2016/02/10/data-science-qa/)

---

Recently, we were able to ask five questions of Murtaza Haider, about the new book from IBM Press called “Getting Started with Data Science: Making Sense of Data with Analytics.” Below, the author talks about the benefits of data science in today’s professional world.

1. What are some examples of data science altering or impacting traditional professional roles already?

Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.

The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.

Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.

As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.

2. What steps can a professional take today to learn how and why to implement data science into their current role?

Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.

For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.

For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.

3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?

Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.

The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.

If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.

A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.

4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?

Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.

Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.

5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?

Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.

As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.

So you want to be a data scientist

2016-08-10T17:40:00.002-04:00

From HuffingtonPost

The New York Times made it look so easy. Take a few courses in data science and a web-based startup will readily pay top dollars for your newly acquired skills.

Since the McKinsey Global Institute reported on the impending shortage of data crunchers, the wanna be data scientists are searching for learning opportunities in big data analytics. Newspaper coverage suggests that even with limited previous exposure to empirics, one may enroll in MOOCs or join programming boot camps to establish one's bonafides.

In a recent blog on Forbes.com, Meta S. Brown, the author of Data Mining for Dummies, gave four reasons not to get an advanced degree in data science. I, on the other hand, believe that a structured learning environment is exactly what many need to enable the career change they have contemplated for years but have not moved on it.

It all depends on upon what kind of a learner you are. If you are a disciplined, self-motivated, self-actuated individual, you can pick up the skills by attending MOOCs or participating in coding boot camps.

But if you are like the rest of us, who once enrolled in a free online course, but didn't complete it, you need some structure. A degree or a certificate in data science or business analytics is exactly what you need to upgrade your skills and be part of the network that will help you reorient your career.

In my book, Getting Started with Data Science, I mentioned Paul Minton, who was making $20,000 serving tables in New York. However, a three-moth programming course at the Zipfian Academy turned his life around. He earned over $100,000 in 2014 as a data scientist for a web startup in San Francisco. "Six figures, right off the bat ... To me, it was astonishing," he told The New York Times.

When the inspiring data scientists think of a career in the 'glamorous' world of big data and analytics, they think of Mr. Minton. His story, though a bit Cinderella-ish, is true, but rare. He works for Change.org! However, not everyone should expect a similar outcome. In addition to good fortune, Mr. Minton had majored in math in his undergraduate training, and we all know that math helps. It will be unwise, however, to assume that with almost no empirical background, one can master the

complex world of data and algorithms in a matter of a few weeks and be gainfully employed.

While speaking at meet-ups organized by IBM's BigDataUniversity, I encounter dozens of enthusiasts who are keen to start training in data science but do not know where to begin. I advise them to build on their core competencies and domain knowledge. For instance, if you studied journalism or creative writing as an undergraduate, you might want to learn how to analyze socioeconomic data instead of trying to set up Hadoop clusters, a big data task best left to computer scientists and engineers.

If you are a disciplined learner, you can explore data science training offered as MOOCs. Coursera, one of the largest MOOCs platform, listed several data science courses among the top 10 most popular courses in 2015. IBM's Big Data University (BDU) is another platform dedicated to promoting training in data science and analytics. Not only BDU offers similar resources for online learning as other platforms, it also offers cloud-based resources for hands-on training through the Data Scientist Workbench.

The Workbench provides the state-of-the-art computing solutions for regular-sized data. These include R, Python, and OpenRefine. To wrangle big data, the Workbench offers Hadoop and Spark-based solutions. Such coupling of computing infrastructure with online learning resources frees the new learners from the concerns about installing and maintaining software and clustering hardware.

For learners who would prefer a structured learning environment, they also have several options. They can register for courses or certificates offered by universities' continuing education faculties, enroll in an online graduate degree in data science, or take a more traditional approach of enrolling in a full- or part-time Master's program.

A good place to search for learning opportunities is the KDNuggets website that maintains detailed lists of post-graduate programs in data science including full-time, part-time, and online masters and other certifications.

Once you have earned some credentials, you still have to prove your worth to future employers. If you are making a switch from another career, your experience may not be of much use in your pursuits in the data-centric world. My advice to the novice data scientists lacking experience is to ask the potential employer not necessarily for a job, but instead for a data set and a puzzle. If you can solve a data-oriented problem for a firm as part of the vetting process, you can overcome the shortcomings in your résumé.

For those who are still on the fence thinking whether to take the plunge into the world of big data and analytics, they should know that the demand for data scientists far exceeds the capacity of the universities and colleges to produce them. This is unlikely to change shortly. Act now and embrace data.

Book Review: Getting Started With Data Science

2016-07-27T14:56:00.000-04:00

I PROGRAMMER's Kay Ewbank's reviews Getting Started with Data Science: Making Sense of Data with Analytics.

By Kay Ewbank

If you've enjoyed books such as Freakonomics or Outliers, you'll feel at home reading this book as it uses a similar approach; take an interesting question such as 'Does the higher price of cigarettes deter smoking?', and use that as the basis for some data analysis.

The aim is to teach you how to do your own analyses. Haider works through the examples in R, Stata, SPSS and SAS. Within the book the examples are worked mainly in R, and one of the other languages. The code for the other languages is available for download from the IBM Press website, along with details of how to use it.

The book opens with a chapter called 'the bazaar of storytellers' that discusses what data science is and gives the author's definition of a data scientist. The next chapter, data in the 24/7 connected world, identifies sources of data that you can analyse, and also introduces the concept of big data. Chapter three looks at how data becomes meaningful when it is used as the basis for 'stories'. Haider's view is that the strength of data science lies in the power of the narrative, and that is what underpins most of the book.

"Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis."

From a practical perspective, the book begins to get useful in chapter four, which looks at how you can generate summary tables, including multi-dimensional tables. Next is a chapter on graphics and how to generate them. If you're thinking that it seems a bit odd to concentrate on the 'end result' first, you have to remember that the author's view is that data analysis is only useful if your audience actually looks at the results and understands them.

The next chapter gets more into the workings of data analysis with an examination of hypothesis testing using techniques such as t-tests and correlation analysis. Regression analysis is looked at next, based on the notions "why tall parents don't have even taller children". This is a fun chapter, with examples including consumer spending on food and alcohol, housing markets, and whether the appearance of teachers affects their evaluations by students.

A chapter on analysis of binary variables considers logit and probit models using data from New York transit use. Categorical data and multinomial variables are the topic of the next chapter, which expands on the ideas of logit models.

Spatial data analysis is covered next, taking us into the use of GIS systems and how these have expanded the options for data analysis. There's a good chapter on time series analysis looking at how regression models can be used with time series data, using the examples of forecasting housing markets.

The final chapter introduces the field of data mining. It's more of a taster discussing some of the techniques that can be used, but fun anyway.

Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis.

Related Reviews

Data Science and Big Data Analytics

Doing Data Science

R in Action: Data Analysis and Graphics with R (2e)

Learning To Love Data Science

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

The collaborative innovation landscape in data science

2016-07-24T19:41:00.000-04:00

Computing platforms should be like Lego. That is, they should provide the fundamental building blocks and enable the users' imagination to innovate. The latest issue of Stata Journal exemplifies how Stata and, by the same account, R provide the platform for the users to innovate beyond the innate capacity of the core group responsible for software development.

Earlier in July, I received in an email the table of contents for the Stata Journal’s latest issue. I was expecting to see one or maybe two articles of interest. What I found was quite surprising. I was intrigued by almost every article, which made me wonder if I had lost my academic focus so that almost anything is now of interest?

As I browsed through the journal, I noticed that the authors contributing to the journal were truly international. From academic colleagues in Germany and the United States to colleagues working for central banks in Europe, the diversity was hard to ignore. And that’s where I spotted the apparent similarity between R and Stata. Even though Stata is a proprietary computing platform, the innovation landscape is not restricted to the core team at Stata. This is similar to the R environment where literally thousands of packages (algorithms) for R are contributed by independent researchers.

For R, such collaborative ecosystem comes naturally for R being free software. Stata, on the other hand, follows a more traditional market approach of charging for the use of the software. Yet, Stata and R are able to attract leading data scientists (my preferred term for statisticians, econometricians, and others) to volunteer their innovation expertise that they readily share with the larger community.

Returning to the latest issue, I was first attracted to the article on assessing inequality using percentile shares. As the author, Ben Jann from the University of Bern, noted income inequality has come to the forefront of academic and social discourse since the publication of Thomas Piketty’s Capital in the Twenty-First Century. I have been intrigued by the topic for years, primarily influenced by the incredible works of Joseph Stiglitz, Angus Deaton, and others. Piketty’s Capital, despite the criticism (watch Deidre Mccloskey’s careful, yet blunt, review of the Capital), has made percentile shares familiar to analyze distributional inequalities.

Ben Jann has contributed pshare to Stata that readily estimates inequalities with the convenience of a single-line syntax. Using the data from the 1988 US National Study of Young Women, the command easily computes the income distribution showing that the top 10-percent women received 27% of the wages.

For R users, I would recommend the ineq package by Achim Zeileis and Christian Kleiber for generating inequality and poverty indices.

My primary area of interest lies at the intersection of real estate and transportation in urban settings. I am always struggling with how location impacts rents, growth, and other socio-economic outcomes. Determining the location or, for that matter, distances between entities is usually a struggle. Thanks to the GIS software, such as QGIS, MapInfo, and Maptitude, the task of spatial computation has become a lot easier. Still, one has to get proficient on several computing platforms to achieve the necessary tasks of getting distances or travel times to and from locations. Stata offers two interesting solutions for these tasks. The latest one is reported in the latest issue. Stephan Huber and Christoph Rust from the University of Regensburg have a contributed a new command that computes network distances (not just the straight-line Euclidean distances) and network travel times for the shortest paths that rely on Open Source Routing Machine and OpenStreetMap.

Earlier in 2011, Adam Ozimek and Daniel Miles contributed commands to geocode and compute travel times between origins and destinations for different modes of travel, including public transit.

R is equally equipped for similar tasks. Timothée Giraud, Robin Cura, and Matthieu Viry programmed an R package osrm to determine travel time and distances. Other R packages include gdistance and gmapdistance, to name a few.

In summary, I remain delightedly optimistic about the future of both open source and proprietary computing platforms. Altruism is the name of the game where thousands of innovators are making their generous contributions available for the larger benefit of the society making it easier for applied data scientists to satisfy their curiosities by applying readily available algorithms to solve riddles.

Data Science Boot Camp completed at Ryerson University

2016-07-18T04:10:00.001-04:00

I am pleased to update you on the Data Science Boot Camp we ran at the Ted Rogers School of Management at Ryerson University in Toronto in collaboration with IBM’s www.BigDataUniversity.com. The 9-week long Boot Camp concluded on July 15.

We received a total of 1,137 registrations and the attendance ranged between 100 to 150 participants each week.

I have made the resources (software codes, PowerPoints, etc.) available online at https://sites.google.com/site/statsr4us/workshops/datascience. We recorded 24 hours of video, which we will be online soon.

I restricted the hands-on training to R, hence the Boot Camp serves as an introduction to analytics with R. You are welcome to share these resources.

A breakdown of weekly schedule is provided in the following hyperlinked list:

Week 0: Introduction & logistics

Week-1: Head first into data

Week-2: Data, data, & data

Week-3: The Grandfather of All Analytics

Week-4: Rinse & Repeat

Week-5: The Economics of Infidelity

Week-6: Time Series is Money

Week-7: Housing Bubbles

Week-8: Choice Metrics

Data Science Boot Camp

2016-05-12T13:03:00.000-04:00

If you live in or near Toronto, are interested in learning about data science, and can spare Friday afternoons, then you are in luck. I am offering a Data Science Boot Camp at Ryerson University in collaboration with IBM's BigDataUniversity.com.

The Boot Camp is largely based on the contents of my recently published book, Getting Started with Data Science: Making Sense of Data with Analytics. You can read more about the book by Clicking HERE.

Logistical details:

When: Fridays (2:00 - 5:00 pm)
Where: 55 Dundas Street West, Toronto, 9th floor, Room 3-109
Ted Rogers School of Management, Ryerson University
Cost: Free (Courtesy Ryerson University & BigDataUniversity)
Starting on: May 13 for introductions. Actual launch is on May 20.
Spaces: I'd like to cap enrollment at 15.
Registration: Email us or use Registration Form at BigDataUniversity.
Prerequisites: Curiosity, high-school math, prescribed book, a laptop computer, and willingness to learn R.

BigDataUniversity will live stream the sessions for those who are unable to attend, but interested in the topic.

Tentative Schedule

May 13, 2016- Introductions, software details, and logistical details.

Week 1 - Taking the first step

Detailed hands-on examples of analytics to understand what you will be able to accomplish by the end of the boot camp.

Week 2 - Data: It’s shapes, sizes, and formats
Week 3 - Regression: The tool that fixes everything, or almost everything.

Applied analytics with teaching evaluations.
Do good-looking instructors get higher teaching evaluations?

Week 4 - Correlations, causations, and manufactured facts
Week 5 - Aerobics with data: Taming your data to meet your needs.
Week 6 - Time is money: Analytics with time series data.
Week 7 - Case study 1:

Do women who lack health insurance from their spouse’s employer more likely to work full-time?

Week 8 - Case Study 2:

Do higher taxes result in lower cigarette sales? Did Land Transfer Tax impact housing sales in Toronto?

Week 9 - Case Study 3:

To smoke or not to smoke: that is the question.

Week 10 - Case study 4:

Is space the new frontier? Map it to know it.

Getting Started with Data Science: Storytelling with Data

2016-01-13T14:01:00.001-05:00

Earlier this month, IBM Press and Pearson have published my book titled: Getting Started with Data Science: Making Sense of Data with Analytics. You can download sample pages, including a complete chapter. There are 104 pages in the sample. You can also watch a brief interview about the book recorded earlier at the IBM Insight2015 Conference.

The very purpose of authoring this book was to rethink the way we have been teaching statistics and analytics to students and practitioners. It is no secret that most students required to take the mandatory stats course dislike it. I believe it has something to do with the way we have been teaching the subject than to do with the aptitude of our students. Furthermore, I believe there is a greater opportunity to equip the students with the skills needed in a world awash with data where competing on analytics defines the real competitive advantage.

No wonder, the latest issue of the leading publication on the subject, The American Statistician, is dedicated to reimagining how statistics should be taught in the undergraduate curriculum. The editors noted:

“We hope that this collection of articles as well as the online discussion provide useful fodder for further review, assessment, and continuous improvement of the undergraduate statistics curriculum that will allow the next generation to take a leadership role by making decisions using data in the increasingly complex world that they will inhabit.”

I am confident that my book will do its small part in equipping the next generation of students with the kind of skills needed to succeed in a data-centric world. For one, I have taken a storytelling approach to statistics. This book reinforces the point that data science and analytics training should be applied rather than theoretical, and the ultimate purpose of producing or consuming statistical analysis is to tell fascinating stories from it. Therefore, the book opens with the chapter titled, The Bazaar of Storytellers.

Who is this book for?

While the world is awash with large volumes of data, inexpensive computing power, and vast amounts of digital storage, the skilled workforce capable of analyzing data and interpreting it is in short supply. A 2011 McKinsey Global Institute report suggests that “the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”

Getting Started with Data Science (GSDS) is a purpose-written book targeted at those professionals who are tasked with analytics, but they do not have the comfort level needed to be proficient in data-driven analytics. GSDS appeals to those students who are frustrated with the impractical nature of the prescribed textbooks and are looking for an affordable text to serve as a long-term reference. GSDS embraces the 24-7 streaming of data and is structured for those users who have access to data and software of their choice, but do not know what methods to use, how to interpret the results, and most importantly how to communicate findings as reports and presentations in print or on-line.

GSDS is a resource for millions employed in knowledge-driven industries where workers are increasingly expected to facilitate smart decision-making using up-to-date information that sometimes takes the form of continuously updating data.

At the same time, the learning-by-doing approach in the book is equally suited for independent study by senior undergraduate and graduate students who are expected to conduct independent research for their coursework or dissertations.

Praise for the book

I am also pleased to share with you the praise for my book by Dr. Munir Sheikh, Canada’s former chief statistician:

“The power of data, evidence, and analytics in improving decision-making for individuals, businesses, and governments is well known and well documented. However, there is a huge gap in the availability of material for those who should use data, evidence, and analytics but do not know how. This fascinating book plugs this gap, and I highly recommend it to those who know this field and those who want to learn.”

— Munir A. Sheikh, Ph.D., Distinguished Fellow and Adjunct Professor at Queen’s University

Tom Davenport, author of the bestselling books Competing on Analytics and Big Data @ Work.has the following to say about my book:

“A coauthor and I once wrote that data scientists held ‘the sexiest job of the 21st century.’ This was not because of their inherent sex appeal, but because of their scarcity and value to organizations. This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most importantly it teaches how to tell a story with data.”

—Thomas H. Davenport, Distinguished Professor, Babson College; Research Fellow, MIT.

Dr. Patrick Surry, Chief Data Scientist at www.Hopper.com had the following to say:

“This book addresses the key challenge facing data science today, that of bridging the gap between analytics and business value. Too many writers dive immediately into the details of specific statistical methods or technologies, without focusing on this bigger picture. In contrast, Haider identifies the central role of narrative in delivering real value from big data.

“The successful data scientist has the ability to translate between business goals and statistical approaches, identify appropriate deliverables, and communicate them in a compelling and comprehensible way that drives meaningful action. To paraphrase Tukey, ‘Far better an approximate answer to the right question, than an exact answer to a wrong one.’ Haider’s book never loses sight of this central tenet and uses many realworld examples to guide the reader through the broad range of skills, techniques, and tools needed to succeed in practical data-science. “Highly recommended to anyone looking to get started or broaden their skillset in this fast-growing field.”

And finally, Professor Atif Mian, author of the best-selling book: The House of Debt offered the following assessment:

“We have produced more data in the last two years than all of human history combined. Whether you are in business, government, academia, or journalism, the future belongs to those who can analyze these data intelligently. This book is a superb introduction to data analytics, a must-read for anyone contemplating how to integrate big data into their everyday decision making.”

— Professor Atif Mian, Theodore A. Wells ’29 Professor of Economics and Public Affairs,
Princeton University; Director of the Julis-Rabinowitz Center for Public Policy and Finance at the Woodrow Wilson School.

Not so sweet sixteen!

2015-12-06T17:19:00.000-05:00

In the world of big data and real-time analytics, Microsoft users are still living with the constraints of the bygone days of little data and basic numeracy.

If you happen to use Microsoft Excel for running Regressions, you will soon realize your limits: The Windows version of Excel 2013 permits no more than 16 explanatory variables.

Excel has made great progress in expanding its capabilities in the recent past. Unlike the few thousand rows in the past, the current version permits about a million rows per Sheet (a single data set). But when it comes to regression, you may have several thousand observations in the data set, you are still limited by a hard constraint of sixteen explanatory variables.

Some would argue that for parsimony, we should be content with the restriction. True, but with categorical variables, the number of explanatory variables stretch beyond the artificial constraints set by Microsoft Excel.

Others might inquire why do statistical analyses in Excel in the first place. Despite the inherent limitations in Microsoft Excel, business schools in particular and other social science undergraduate programs in general, are increasingly turning to Excel to teach courses in statistics. If you were to take a quick look at the curriculum of the undergraduate business and numerous MBA programs, you would realize how widespread is the use of Excel for courses in statistics and analytics.

At Ryerson University, I switched to R years ago for my MBA courses. Thanks to John Fox’s R Commander, the transition to R was without much hassle. The students were told in the very beginning that they were now part of the big league, and hiding behind spreadsheets was no longer an option.

I must mention that Microsoft Excel continues to be my platform of choice for a variety of tasks. I use Excel several times a day, but not for statistical analysis. I am not suggesting that Excel cannot do statistics; I am arguing that it can do a much better job of it.

As I see it, Microsoft has several options. First is do nothing. After all, Microsoft Excel has no real competition in the Windows environment. Second, it could turn to the team that has programmed the linest function in Excel and ask them to add some muscle to it. That will be the wrong approach.

Instead, Microsoft should explore ways to integrate R or another freeware with Excel to add a complete analytics menu. Microsoft should learn from what the leaders in analytics are already doing. SPSS, an industry leader in analytics category, has already integrated R, allowing the SPSS users to merge the robust data management strengths of SPSS with the state-of-the-art analytics bundled with R. SAS, another big name in analytics, is about to do the same.

And since Microsoft has recently acquired Revolution R, it makes even more sense to build a bridge between Excel and Revolution R Open (RRO).

R Through Excel is one example of integrating R with Excel. If Microsoft were to put its weight behind the initiative, it could build a seamless coupling with R expanding the analytic capabilities for hundreds of million Excel users.

As for the SPSS, I recommend they also consider another option. If Microsoft were to integrate RRO with Excel, they could acquire an advanced analytics software and integrate it with SPSS. For this option, I would recommend Limdep, which I have found to be the most diverse software for statistical analysis and econometrics. Even though R is a collective effort of thousands of software developers, Limdep offers numerous routines and post-estimation options that are not available in the thousands of R packages. SPSS integrated with Limdep could become the most diversely capable commercial software in the market as it will bridge the gap with SAS and Stata.

As for the colleagues in business faculties pondering over what platform to adopt for the analytics/software courses, I would say know your limits, especially with Microsoft Excel while deciding upon the curriculum.

Curious about big data in Montreal?

2015-10-30T11:10:00.000-04:00

Are you in Montreal and curious about big data? Well here is your chance to attend a session about the same at Concordia University on Tuesday, Nov. 03 at 6:00 pm.

www.BigDataUniversity.com, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.

BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.

Further details are available at:

Getting started with Data Science and Introduction to Watson Analytics

http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/

When: Tuesday, November 3rd at 6-9 PM

Where: H1269, 12th floor of the Hall Bldg

(1455, blvd. De Maisonneuve ouest - Metro Guy-Concordia)

Are Canadian newspapers painting false pictures with data?

2015-05-20T15:35:00.000-04:00

The Canadian newspaper, Globe and Mail, is a leader in diction and style, but it may need improvement in the ‘grammar of graphics’.

Globe’s recent depiction of metropolitan economic growth in the series Off the Charts was way off the mark. The chart plotted the current and forecasted GDP growth rates for select cities in Canada. The red-coloured upward sloping lines depicted cities with increasing economic growth rates and the grey-colored downward sloping lines highlighted those with slowing economic growth.

There is, however, a small problem. The chart erroneously showed some slowing economies as growing and vice versa. Furthermore, the trajectory of the sloping lines would mislead the readers to assume that cities with parallel lines enjoyed a similar increase in the growth rate, which, of course, is not true. The graphical faux pas was certainly avoidable had a bar chart were used.

Source: The Globe and Mail, Page B6, May 15.

Of course, the Globe and Mail is not alone in coming up with math that simply doesn’t add up. While covering the Scottish independence vote in September 2014, CNN reported that Scots voted a 110% in the referendum such that 58% voted yes and another 52% voted no.

Source: Mail Online. September 19, 2014

The recent rise of data journalism has witnessed the emergence of data visualization where the editors increasingly reinforce narrative with creative info-graphics. While major news outlets such as The Economist, The New York Times, and the Wall Street Journal retained experts in data science and visualization, most newspapers have entrusted the task to the graphics departments that rely on tools that are not specifically designed for data visualization. At times, the outcome is math- and logic-defying graphics that present a false picture.

Even when charts correctly depict data, at times the visualizations are too complex for the ordinary newsreader to grasp. Powerful data visualizations tools, such as D3 (a JavaScript library) are often abused to create graphics too rich in detail to comprehend. The use of Hierarchical Edge Bundling, for instance, is becoming increasingly popular in the news media resulting in complex graphics that are visually impressive, but conceptually confusing.

Edward Tufte and Leland Wilkinson have spent a lifetime advising data enthusiasts on how to present data-driven information. Wilkinson is the author of The Grammar of Graphics, which sets out the fundamentals for presenting data. Wilkinson’s writings inspired Hadley Wickham to develop ggplot2, a graphing engine for R, which is increasingly becoming the tool of choice for data scientists.

Tufte inspired Dona M. Wong, who was the graphics director at the Wall Street Journal. Ms. Wong authored The Wall Street Journal Guide to Information Graphics. Her book is a quintessential guide for those who work with data and would like to present information as charts. She uses examples from the Journal to illustrate the dos and don’ts of presenting data as info-graphics.

Let us return to the forecasted metropolitan growth rates in Canada. I prefer the horizontal bar chart instead. The bar chart offers me several options to highlight the main argument in the story. If I were interested in highlighting cities with the highest gains in growth since 2014, I would sort the cities accordingly, as is illustrated in the graphic on the left (see below). If I were interested in highlighting cities with the highest forecasted growth rate, I would sort them accordingly to result in the graphic on the right.

Dana Wong insists on simplicity in rendering. She concludes her book with a simple message for data visualization: simplify, simplify, simplify. The two bar charts simplify the same information presented by the Globe. The results are obvious: I avoid misrepresenting data. One can readily see Halifax’s economy is forecasted to grow and Vancouver’s to shrink. The Globe’s rendering depicted exactly the opposite.

UP Express in Toronto: A train less ridden

2015-04-23T13:24:00.001-04:00

What does a billion dollars' worth of transit investment get in Toronto? A piddly 5,000 daily riders. To put things in perspective, dozens of bus routes in Toronto carry more passengers every day than the trips forecasted for the Union-Pearson rail link (UP Express).

The rail link will connect Canada's two busiest transport hubs: The Union Station and the Pearson Airport. Despite the high-speed connector between the two busiest hubs, transport authorities expect only 5,000 daily riders on the UP Express. The King Streetcar, in comparison, carries in excess of 65,000 daily riders.

The UP Express and the Sheppard subway extension are examples of transit money well wasted. A 2009 communiqué by Metrolinx estimated that the George Town Expansion (including the UP Express) will cost over a billion dollars. The Globe and Mail reported Ontario government alone had invested $456 million in the UP Express. Instead of investing the scarce transit dollars on projects likely to deliver the highest increase in transit ridership, billions are being spent on projects that will have a marginal impact on addressing traffic congestion in the GTA.

Source: www.upexpress.com

With $29 billion in planned transport infrastructure investments, some of which will be publicised Thursday in the Ontario budget, the Province and the City need to have their priorities right. The very least would be to stop investing in projects that do not generate sufficient transit ridership.

One may argue that 5,000 fewer trips by automobile to and from the Airport should help in easing congestion in the GTA. However, with over 12-million daily trips in the GTA, 5,000 fewer trips are unlikely to make any meaningful difference in traffic congestion. At the same time, the taxpayers should focus on the cost-benefit trade-offs for transit investments. Notice the cost-benefit efficiency of the existing TTC bus service (192 Airport Rocket) to the Pearson Airport that carries over 4,000 daily passengers. A billion dollars later, the UP Express will move only one thousand additional riders.

In North America, fewer than 10 airports are connected with local subway or regional rail transit. With the exception of the Ronald Reagan International Airport in Washington, DC, most other airports accessible by rail report approximately 5% transit trips to and from airports. The European experience though has been better. Almost 35% of the trips to and from Zurich airport were made on rail-based transit. Munich airport reported 40% of the trips by rail and bus.

Certain transit network attributes, which are missing for the UP Express, contribute to the strong transit ridership to and from airports. For instance, the rail-based service to high transit ridership airports does not terminate at the airport but instead continues further to serve the communities along the corridor. In addition, the airport lines at the successful airports are integrated with the rest of the rail-based transit system, instead of being a standalone line. The UP Express is a standalone rail line that connects to only one terminal at Pearson Airport. The prohibitive fare makes the ride uneconomical for commuters travelling in teams of two or more who would find a cab ride cheaper and convenient from most parts of suburban Toronto.

Two other key factors limit the ridership potential of the UP Express. First, the Billy Bishop Airport near downtown Toronto caters to the short-haul business travel market. It has been argued in the past that business travellers originating in downtown Toronto would rather take the train than a cab to Pearson Airport. Given the frequency of service and choice of destinations served by the Billy Bishop Airport, business travellers increasingly favour the downtown airport, which eats into the UP Express potential market share.

In addition, the peak operations at Pearson Airport coincide with the morning and afternoon peak commuting times in Toronto. This implies that one would have to commute to Union Station in the morning and afternoon peak travel periods to ride the UP Express. The extra effort in time and money required to travel to downtown Toronto from the inner suburbs alone will deter riders from using the Union-Person rail link.

The UP Express is yet another monument dedicated to public transit misadventures while the region continues to suffer from gridlock. Getting the transit priorities right is necessary before Ontario dolls out $29 billion.

Stata embraces Bayesian statistics

2015-04-09T13:50:00.000-04:00

Stata 14 has just been released. The new and big thing with version 14 is the introduction of Bayesian Statistics. A wide variety of new models can now be estimated with Stata by combining 10 likelihood models, 18 prior distributions, different types of outcomes, and multiple equation models. Stata has also made available a 255-page reference manual for free to illustrate Bayesian statistical analysis.

Of course R already offered numerous options for Bayesian Inference. It will be interesting to hear from colleagues proficient in Bayesian statistics to compare Stata’s newly added functionality with what has already been available from R.

Given the hype with big data and the newly generated demand for data mining and advanced analytics, it would have been timely for Stata to also add data mining and machine learning algorithms. My two cents: data mining algorithms are in greater demand than Bayesian statistics. Stata users will have to wait for a year or more to see such capabilities. In the meanwhile, R offers several options for data mining and machine learning algorithms.

R: What to do when help doesn't arrive

2014-11-03T13:56:00.000-05:00

R is great when it works. Not so much, when it doesn’t. Specifically, this becomes a concern when the packages are not fully illustrated in the accompanied help documentation, and the author/s of the package don’t respond (in time).

I am not suggesting that the package authors should respond to every email that they receive. My request is that the documentation should be complete enough so that the authors’ help is no longer required on a day-to-day basis.

Recently, a colleague in the US and I became interested in the mlogit package. We wanted to use the weights option in the package. Just like most other packages, mlogit does not illustrate how to use weights, but advises that the option is available. We assumed that the weights would work in a certain way (see page 26 of the hyperlinked document). However, when I estimated the model with weights, mlogit did not replicate the results from a popular textbook on econometrics. Here are the details.

We wanted to see if the the weights option could be used in an alternative specific logit formulation when the sampled data do not conform to the market shares observed in the underlying population? For instance, in a travel choice model, one may be tempted to over sample train commuters and under-sample car commuters because often car commuters far outnumber the train commuters for inter-city travel in the underlying population. This is true for most of Canada and the US. In such circumstances, we would weight the data set so that the estimated model reproduces the population market shares rather than the sample shares.

The commercially available software, NLogit/LimDep can do this with ease. I wanted to replicate the results for choice-based weights for the conditional logit model in Professor Bill Greene's book, Econometric Analysis. This is illustrated on page 853 of the 6th edition of the text where Table 23.24 presents the parameter estimates for a conditional (McFadden) logit model for the un-weighted and the choice-based weighted versions. I replicated the results using NLogit with a simple addition of population market shares in the two-line syntax. However, the results generated by mlogit package bear no resemblance to the ones listed in Econometric Analysis.

It turns out that Stata is also limited in the way it handles weights for the two estimation options: asclogit and clogit. I know this because colleagues at Stata were quite diligent in responding to my requests. It’s not the same with the mlogit, which may or may not be able to handle the weights. We will only know when the author responds.

I am recommending that it should not be left to the individual authors to bear the sole responsibility for supporting the R packages. The individual could be ill, busy, or unavailable for a variety of reasons. This limitation could be proactively dealt with if the R community collectively generates help documentation by detailed worked-out examples of all available options (including weights), and not the few frequently used ones.

Improving documentation will be key to helping R branch out to the everyday users of statistical analysis. The tech-savvy can iron out the kinks. They have the curiosity, patience, and time on their hand. The rest of the world is not that fortunate.

I propose that users of the packages, and not just the authors, should collaborate to generate help documentation as vignettes and YouTube videos. This will do more in popularizing R than another 6,000 new packages that few may know how to work with.

Summarize statistics by Groups in R & R Commander

2013-12-08T22:35:00.001-05:00

R is great at accomplishing complex tasks. Doing simple things with R though takes some effort. Consider the simple task of producing summary statistics for continuous variables over some factor variables. Using Stata, I’d write a brief one-liner to get the mean for one or more variables using another variable as a factor. For instance, tabstat Horsepower RPM, by(Type) in Stata produces the following:

The doBy package in R offers similar functionality and more. Of particular interest for those who teach R based statistics courses in the undergraduate programs is the doBy plugin for R Commander. The plugin was developed by Jonathan Lee and it is a great tool for teaching and for quick data analysis. To get the same output as the one listed above, I’d click on the doBy plugin to get the following dialogue box:

The dialogue box results in the following simple syntax:

summaryBy(Horsepower+RPM~Type, data=Cars93, FUN=c(mean))

You may first have to load the data set:
data(Cars93, package="MASS")

And the results are presented below:

Jonathan has also created GUIs for order by, sample by, and split by within the same plug-in. A must use plug-in for data scientists.