Monday, May 20, 2019
Modern Data Science with R: A review
Saturday, March 10, 2018
R: simple for complex tasks, complex for simple tasks
stargazer(mtcars[c("mpg", "disp", "hp")], type="text")
============================================ Statistic N Mean St. Dev. Min Max -------------------------------------------- mpg 32 20.091 6.027 10.400 33.900 disp 32 230.722 123.939 71.100 472.000 hp 32 146.688 68.563 52 335 --------------------------------------------
Wednesday, March 7, 2018
Is it time to ditch the Comparison of Means (T) Test?
Is Regression a valid substitute for T-tests?
Dataset
Hypothetically Speaking
Assuming Equal Variances (Stata)
Unequal Variances
Repeating the same analysis in R
Thursday, September 1, 2016
The X-Factors: Where 0 means 1
Consider the following example where I present a data set with two variables: x and y. I represent age in years as 'y' and gender as a binary (0/1) variable as 'x' where 1 represents males.
I compute the means for the two variables as follows:
The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.
Monday, August 22, 2016
Five Questions about Data Science
1. What are some examples of data science altering or impacting traditional professional roles already?
The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.
Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.
As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.
2. What steps can a professional take today to learn how and why to implement data science into their current role?
Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.
For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.
3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?
Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.
If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.
A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.
4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?
Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.
5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?
Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.
Wednesday, January 13, 2016
Getting Started with Data Science: Storytelling with Data
The very purpose of authoring this book was to rethink the way we have been teaching statistics and analytics to students and practitioners. It is no secret that most students required to take the mandatory stats course dislike it. I believe it has something to do with the way we have been teaching the subject than to do with the aptitude of our students. Furthermore, I believe there is a greater opportunity to equip the students with the skills needed in a world awash with data where competing on analytics defines the real competitive advantage.
No wonder, the latest issue of the leading publication on the subject, The American Statistician, is dedicated to reimagining how statistics should be taught in the undergraduate curriculum. The editors noted:
“We hope that this collection of articles as well as the online discussion provide useful fodder for further review, assessment, and continuous improvement of the undergraduate statistics curriculum that will allow the next generation to take a leadership role by making decisions using data in the increasingly complex world that they will inhabit.”I am confident that my book will do its small part in equipping the next generation of students with the kind of skills needed to succeed in a data-centric world. For one, I have taken a storytelling approach to statistics. This book reinforces the point that data science and analytics training should be applied rather than theoretical, and the ultimate purpose of producing or consuming statistical analysis is to tell fascinating stories from it. Therefore, the book opens with the chapter titled, The Bazaar of Storytellers.
Who is this book for?
While the world is awash with large volumes of data, inexpensive computing power, and vast amounts of digital storage, the skilled workforce capable of analyzing data and interpreting it is in short supply. A 2011 McKinsey Global Institute report suggests that “the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”Getting Started with Data Science (GSDS) is a purpose-written book targeted at those professionals who are tasked with analytics, but they do not have the comfort level needed to be proficient in data-driven analytics. GSDS appeals to those students who are frustrated with the impractical nature of the prescribed textbooks and are looking for an affordable text to serve as a long-term reference. GSDS embraces the 24-7 streaming of data and is structured for those users who have access to data and software of their choice, but do not know what methods to use, how to interpret the results, and most importantly how to communicate findings as reports and presentations in print or on-line.
GSDS is a resource for millions employed in knowledge-driven industries where workers are increasingly expected to facilitate smart decision-making using up-to-date information that sometimes takes the form of continuously updating data.
At the same time, the learning-by-doing approach in the book is equally suited for independent study by senior undergraduate and graduate students who are expected to conduct independent research for their coursework or dissertations.
Praise for the book
I am also pleased to share with you the praise for my book by Dr. Munir Sheikh, Canada’s former chief statistician:“The power of data, evidence, and analytics in improving decision-making for individuals, businesses, and governments is well known and well documented. However, there is a huge gap in the availability of material for those who should use data, evidence, and analytics but do not know how. This fascinating book plugs this gap, and I highly recommend it to those who know this field and those who want to learn.”— Munir A. Sheikh, Ph.D., Distinguished Fellow and Adjunct Professor at Queen’s University
Tom Davenport, author of the bestselling books Competing on Analytics and Big Data @ Work.has the following to say about my book:
“A coauthor and I once wrote that data scientists held ‘the sexiest job of the 21st century.’ This was not because of their inherent sex appeal, but because of their scarcity and value to organizations. This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most importantly it teaches how to tell a story with data.”—Thomas H. Davenport, Distinguished Professor, Babson College; Research Fellow, MIT.
Dr. Patrick Surry, Chief Data Scientist at www.Hopper.com had the following to say:
“This book addresses the key challenge facing data science today, that of bridging the gap between analytics and business value. Too many writers dive immediately into the details of specific statistical methods or technologies, without focusing on this bigger picture. In contrast, Haider identifies the central role of narrative in delivering real value from big data.And finally, Professor Atif Mian, author of the best-selling book: The House of Debt offered the following assessment:
“The successful data scientist has the ability to translate between business goals and statistical approaches, identify appropriate deliverables, and communicate them in a compelling and comprehensible way that drives meaningful action. To paraphrase Tukey, ‘Far better an approximate answer to the right question, than an exact answer to a wrong one.’ Haider’s book never loses sight of this central tenet and uses many realworld examples to guide the reader through the broad range of skills, techniques, and tools needed to succeed in practical data-science. “Highly recommended to anyone looking to get started or broaden their skillset in this fast-growing field.”
“We have produced more data in the last two years than all of human history combined. Whether you are in business, government, academia, or journalism, the future belongs to those who can analyze these data intelligently. This book is a superb introduction to data analytics, a must-read for anyone contemplating how to integrate big data into their everyday decision making.”— Professor Atif Mian, Theodore A. Wells ’29 Professor of Economics and Public Affairs,
Princeton University; Director of the Julis-Rabinowitz Center for Public Policy and Finance at the Woodrow Wilson School.
Sunday, December 6, 2015
Not so sweet sixteen!
If you happen to use Microsoft Excel for running Regressions, you will soon realize your limits: The Windows version of Excel 2013 permits no more than 16 explanatory variables.
Excel has made great progress in expanding its capabilities in the recent past. Unlike the few thousand rows in the past, the current version permits about a million rows per Sheet (a single data set). But when it comes to regression, you may have several thousand observations in the data set, you are still limited by a hard constraint of sixteen explanatory variables.
Some would argue that for parsimony, we should be content with the restriction. True, but with categorical variables, the number of explanatory variables stretch beyond the artificial constraints set by Microsoft Excel.
Others might inquire why do statistical analyses in Excel in the first place. Despite the inherent limitations in Microsoft Excel, business schools in particular and other social science undergraduate programs in general, are increasingly turning to Excel to teach courses in statistics. If you were to take a quick look at the curriculum of the undergraduate business and numerous MBA programs, you would realize how widespread is the use of Excel for courses in statistics and analytics.
At Ryerson University, I switched to R years ago for my MBA courses. Thanks to John Fox’s R Commander, the transition to R was without much hassle. The students were told in the very beginning that they were now part of the big league, and hiding behind spreadsheets was no longer an option.
I must mention that Microsoft Excel continues to be my platform of choice for a variety of tasks. I use Excel several times a day, but not for statistical analysis. I am not suggesting that Excel cannot do statistics; I am arguing that it can do a much better job of it.
As I see it, Microsoft has several options. First is do nothing. After all, Microsoft Excel has no real competition in the Windows environment. Second, it could turn to the team that has programmed the linest function in Excel and ask them to add some muscle to it. That will be the wrong approach.
Instead, Microsoft should explore ways to integrate R or another freeware with Excel to add a complete analytics menu. Microsoft should learn from what the leaders in analytics are already doing. SPSS, an industry leader in analytics category, has already integrated R, allowing the SPSS users to merge the robust data management strengths of SPSS with the state-of-the-art analytics bundled with R. SAS, another big name in analytics, is about to do the same.
And since Microsoft has recently acquired Revolution R, it makes even more sense to build a bridge between Excel and Revolution R Open (RRO).
R Through Excel is one example of integrating R with Excel. If Microsoft were to put its weight behind the initiative, it could build a seamless coupling with R expanding the analytic capabilities for hundreds of million Excel users.
As for the SPSS, I recommend they also consider another option. If Microsoft were to integrate RRO with Excel, they could acquire an advanced analytics software and integrate it with SPSS. For this option, I would recommend Limdep, which I have found to be the most diverse software for statistical analysis and econometrics. Even though R is a collective effort of thousands of software developers, Limdep offers numerous routines and post-estimation options that are not available in the thousands of R packages. SPSS integrated with Limdep could become the most diversely capable commercial software in the market as it will bridge the gap with SAS and Stata.
As for the colleagues in business faculties pondering over what platform to adopt for the analytics/software courses, I would say know your limits, especially with Microsoft Excel while deciding upon the curriculum.
Friday, October 30, 2015
Curious about big data in Montreal?
www.BigDataUniversity.com, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.
BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.
Further details are available at:
Getting started with Data Science and Introduction to Watson Analytics
http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/Wednesday, May 20, 2015
Are Canadian newspapers painting false pictures with data?
Thursday, April 9, 2015
Stata embraces Bayesian statistics
Of course R already offered numerous options for Bayesian Inference. It will be interesting to hear from colleagues proficient in Bayesian statistics to compare Stata’s newly added functionality with what has already been available from R.
Given the hype with big data and the newly generated demand for data mining and advanced analytics, it would have been timely for Stata to also add data mining and machine learning algorithms. My two cents: data mining algorithms are in greater demand than Bayesian statistics. Stata users will have to wait for a year or more to see such capabilities. In the meanwhile, R offers several options for data mining and machine learning algorithms.
Sunday, December 8, 2013
Summarize statistics by Groups in R & R Commander
R is great at accomplishing complex tasks. Doing simple things with R though takes some effort. Consider the simple task of producing summary statistics for continuous variables over some factor variables. Using Stata, I’d write a brief one-liner to get the mean for one or more variables using another variable as a factor. For instance, tabstat Horsepower RPM, by(Type) in Stata produces the following:
The doBy package in R offers similar functionality and more. Of particular interest for those who teach R based statistics courses in the undergraduate programs is the doBy plugin for R Commander. The plugin was developed by Jonathan Lee and it is a great tool for teaching and for quick data analysis. To get the same output as the one listed above, I’d click on the doBy plugin to get the following dialogue box:
The dialogue box results in the following simple syntax:
summaryBy(Horsepower+RPM~Type, data=Cars93, FUN=c(mean))
You may first have to load the data set:
data(Cars93, package="MASS")
And the results are presented below:
Jonathan has also created GUIs for order by, sample by, and split by within the same plug-in. A must use plug-in for data scientists.









