Monday, May 20, 2019
Modern Data Science with R: A review
Monday, October 8, 2018
A question and an answer about recoding several factors simultaneously in R
The Answer
lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on stackoverflow. The solution is simple. The following code works:
"levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters."
Wednesday, March 7, 2018
Is it time to ditch the Comparison of Means (T) Test?
Is Regression a valid substitute for T-tests?
Dataset
Hypothetically Speaking
Assuming Equal Variances (Stata)
Unequal Variances
Repeating the same analysis in R
Monday, August 22, 2016
Five Questions about Data Science
1. What are some examples of data science altering or impacting traditional professional roles already?
The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.
Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.
As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.
2. What steps can a professional take today to learn how and why to implement data science into their current role?
Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.
For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.
3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?
Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.
If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.
A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.
4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?
Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.
5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?
Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.
Monday, July 18, 2016
Data Science Boot Camp completed at Ryerson University
Friday, October 30, 2015
Curious about big data in Montreal?
www.BigDataUniversity.com, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.
BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.
Further details are available at:
Getting started with Data Science and Introduction to Watson Analytics
http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/Monday, July 30, 2012
Big data, big analytics, big opportunity
Data, data, every where
Nor any byte to think
The world today is awash with data. Corporations, governments, and individuals are busy generating petabytes of data on culture, economy, environment, religion, and society. While data has become abundant and ubiquitous, data analysts needed to turn raw data into knowledge are in fact in short supply.
With big data comes big opportunity for the educated middle class in the developing world where an army of data scientists can be trained to support the offshoring of analytics from the western countries where such needs are unlikely to be filled from the locally available talent.
In a 2011 report, McKinsey Global Institute revealed that the United States alone faces a shortage of almost 200,000 data analysts. The American economy requires an additional 1.5 million managers proficient in decision-making based on insights gained from the analysis of large data sets. And even when Hal Varian, Google’s famed chief economist, profoundly proclaimed that “the real sexy job in 2010s is to be a statistician,” there were not many takers for the opportunity in the West where students pursuing degrees in statistics, engineering, and other empirical fields are small in number and are often visa students from abroad.
A recent report by Statistics Canada revealed that two-thirds of those who graduated with a PhD in engineering from a Canadian University in 2005 spoke neither English nor French as mother tongue. Similarly, four out of 10 PhD graduates in computers, mathematics, and physical sciences did not speak a western language as mother tongue. Also, more than 60 per cent of engineering graduates were visible minorities, suggesting that the supply chain of highly qualified professional talent in Canada, and to a large extent in North America, is already linked to the talent emigrating from China, Egypt, India, Iran, and Pakistan.
The abundance of data and the scarcity of analysts present a unique opportunity for developing countries, which have an abundant supply of highly numerate youth who could be trained and mobilized en masse to write a new chapter in offshoring. This would require a serious rethink for thought leaders in developing countries who have not taxed their imaginations beyond thinking of policies to create sweat shops where youth would undersell their skills and see their potential wilt away while creating undergarments for consumers in the west. The fate of the youth in developing countries need not be restricted to stitching underwear or making cold calls from offshored call-centers in order for them to be part of the global value chains. Instead, they can be trained as skilled number-crunchers who would add value to otherwise worthless data for businesses, big and small.
A multi-billion dollar industry
The past decade has witnessed a major change in the sectorial evolution of some very large manufacturing firms known in the past for mostly hardware engineering and now evolving into firms delivering services, such as business analytics. Take IBM for example, which specialized as a computer hardware company producing servers, desktop computers, laptops, and other supporting infrastructure. That was IBM’s past. Today, IBM is focused on analytics. It is spending hundreds of millions of dollars in advertising, trying to rebrand itself as a leader in business analytics. In fact, it has divested from several hardware initiatives, such as manufacturing laptops, and has instead spent billions in acquisitions to build its analytic credentials. For instance, IBM has acquired SPSS for over a billion dollars to capture the retail side of the Business analytics market. For large commercial ventures, IBM acquired Cognos to offer full service analytics.
In 2011 alone, the business analytics software market was worth over $30 billion. Oracle ($6.1bn), SAP ($4.6 bn), IBM ($4.4 bn), and Microsoft and SAS each with $3.3 bn in sales led the market. It is estimated that the sale of business analytics software alone will hit $50 billion by 2016. Dan Vesset of IDC, a company specializing in watching industry trends, aptly noted that business analytics had “crossed the chasm into the mainstream mass market” and the “demand for business analytics solutions is exposing the previously minor issue of the shortage of highly skilled IT and analytics staff.”
In addition to the bundled software and service sales offered by the likes of Oracle and IBM, business analytics services in the consulting domain generated several billion dollars more worldwide. While the large firms command the lion’s share in the analytics market, the billions left at the bottom are still a large enough prize to take the analytics plunge.
Several billion reasons to hop on the analytics bandwagon
While the IBMs of the world are focused largely on large corporations, the analytics needs for small and medium-sized enterprises (SMEs) are unlikely to be met by IBM, Oracle, or other large players. Cost is the most important determinant. SMEs prefer to have analytics done on the cheap while the overheads of the large analytics firms run into millions of dollars thus pricing them out of the SME market. With offshoring comes the access to affordable talent in developing countries who can bid for smaller contracts and beat the competition in the West on price, and over time on quality as well.
The trick therefore, is to beat the IBMs of the world in the analytics game by not competing against them. Realizing that business analytics is not a market, but an amalgamation of several types of markets focused on delivering value-added services involving data capture, data warehousing, data cleaning, data mining, and data analysis, developing countries can carve out a niche for themselves by focusing exclusively on contracts that large firms will not bid for because of their intrinsic large overheads.
Leaving the fight for top dollars in analytics to top dogs, a cottage industry in analytics could be developed in the developing countries that may strive to serve the analytics need of SMEs. Take the example of the Toronto Transit Commission (TTC), Canada’s largest public transit agency with annual revenues exceeding a billion dollars. When TTC needed to have a large database of almost a half million commuter complaints analyzed, it turned to Ryerson University, rather than a large analytics firm. TTC’s decision to work with Ryerson University was motivated by two considerations. First the cost; as a public sector university, Ryerson believes strongly in serving the community and thus offered the services for gratis. The second reason is quality. Ryerson University, like most similar institutions of higher learning, excels in analytics where several faculty members work at the cutting edge of analytics and are more than willing to apply their skills to real life problems.
Why now?
The timing had never been better to undertake such an endeavor on a very large scale. The innovations in Information and Communication Technology (ICT) and the ready availability of the most advanced analytics software as freeware allows entrepreneurs in developing countries to compete worldwide. The Internet makes it possible to be part of global marketplaces with negligible costs. With cyber marketplaces such as Kijiji and Craigslist individuals can become proprietors offering services worldwide.
Using the freely available Google Sites, one can have a business website online immediately at no cost.Google Docs, another free service from Google, allows one to have a web server for free to share documents with collaborators or the rest of the world for free. Other free services, such as Google Trends, allow individual researchers to generate data on business and social trends without needing subscriptions for services that cost millions. The graph below is generated using Google trends showing daily visits to the websites of leading analytics firms. Without free access to such services, access to the data used to generate the same graph would carry a huge price tag.
Similarly, another free service from Google allows one to determine, for instance, which cities registered the highest number of search requests for ‘business analytics’. It appears that four of the top six cities where analytics are most popular are located in India, which is evident from the following graph where search intensity is mapped on a normalized index of 0 to 100.
The other big development of recent times is freeware that is leveling the playing field between haves and have-nots. In analytics, one of the most sophisticated computing platforms is R, which is available for free. Developers worldwide are busy developing the R platform, which now offers over 3,000 packages for free for analyzing data. From econometrics to operations research, R is fast becoming the lingua franca for computing. R has evolved from being popular just amongst computing geeks to having its praise sung by the New York Times.
R has also made some new friends, especially Paul Butler, a Canadian student who became a worldwide sensation by mapping the geography of Facebook. While being an intern at Facebook, Paul analyzed gigabytes of data to plot how Facebook’s friends were linked globally. His map (see the image below) became an instant hit worldwide and has been reproduced in publications thousands of times. If you are wondering what software Paul used to generate the map, wonder no more, the answer is R.
R is fast becoming the preferred computing platform for data scientists worldwide. For decades the data analysis market was ruled by the likes of SAS, SPSS, Stata and other similar players. R has taken over the imagination of data analysts as of late who are fast converging to R, especially after R’s ability to interact with Hadoop (another open source platform) for analyzing big data . In fact, most innovations in statistics are first coded in R so that the algorithms become available to all immediately and for free.
Source: http://r4stats.com/articles/popularity/
The fact that R is freely available should not be taken lightly. A commercial license of a similarly equipped version of SPSS may cost up to US$7,500. The other big advantage of using R is the fact that thousands of training documents on the Internet and videos on YouTube are also available for free by volunteers.
Where to next
The private sector has to take the lead for business analytics to take root in developing countries. The governments could also have a small role in regulation. However, the analytics revolution has to take place not because of the public sector, but in spite of it. Even public sector universities in developing countries cannot be entrusted with the task where senior university administers do not warm up to innovative ideas unless they involve a junket in Europe or North America. At the same time the faculty in public sector universities in developing countries is often unwilling to try new technologies.
The private sector in developing countries may want to launch first an industry group that takes upon the task of certifying firms and individuals interested in analytics for quality, reliability, and ethical and professional competencies. This will help build confidence around national brands. Without such certification, foreign clients will be apprehensive to share their proprietary data with individuals hidden behind computer monitors thousands of miles away.
The private sector will also have to take the lead in training a professional workforce in analytics. Several companies train their employees in the latest technology and then market their skills to clients. The training houses would therefore also double as consulting practices where the best graduates may be retained as consultants.
Small virtual marketplaces could be setup in large cities where clients can put requests for proposals and pre-screened, qualified bidders can compete for the contract. The national self-regulating body will be responsible for screening qualified bidders from its vendor-of-record database, which it would make available to clients globally through the Internet.
The IBMs of the world see the analytics market to hit hundreds of billions in revenue in the next decade. The abundant talent in developing countries can be polished into a skilled workforce to tap into the analytics market to channel some revenue to developing countries while creating gainful employment opportunities for the educated youth who have been reduced to making cold calls from offshored call centers.
Thursday, January 6, 2011
In praise of the article
As a non-native speaker of English language, I have always struggled with the elusive article, especially ‘the’. When should ‘the’ be used is not intuitive to me. Therefore, I rely on rules to determine when to use an article.
Over the years one should not expect any change in the frequency of use of articles in English language. However, one could observe a significant decline in the use of the definite article (the) in American and British English. See the graph below, which shows that in American English the definite article ‘the’ represented 5.5% of the words used in the books published in English in the United States. These are the books scanned by Google as part of its initiative to digitize every published book. However, one sees a decline in the use of the article ‘the’ starting in 1970s. I wonder why. Is the language referring more to proper nouns and hence the decline in ‘the’. Also ‘the’ has been used much more frequently than ‘a’ or ‘an’.
The books published in English in England and scanned by Google present almost a similar trend, which is visible in the graph below.
Data through the history
Analytics and data are becoming ubiquitous in finance, politics, and other spheres of life, such as friendships where people now boast about how many friends they have on Facebook. It was however not very long ago that the word data was not even part of the everyday lexicon. See the graph below, which shows the evolution of the word data over the past 100 years in the books digitized by Google. The graph immediately below is that of word data used in books published in English in the United States. The y-axis presents the share of the word data in a given year as a percentage of all words published in books in that particular year.
Data saw an earlier increase in its mention in 1920s in American English. However, it was only in the 1960s when the use of data become more pronounced and remained so until mid 1980s. It was the period when Robert McNamara, the most prominent of all quants, tried to win a war in Vietnam by improving the analytics. He failed. A decline in its mention is observed 1990s and then a quick reversal with a rapid increase in its mention from late 1990s to the first few years of the new millennium. The decline continues again in the mention of the word data.![]()
The graph below shows the same for books in English that were published in England. The decline in its mention in the past decade seems to be levelling off in the UK.
Saturday, December 25, 2010
Using statistics to understand armed conflict
Drew Conway, a doctoral student in New York, uses statistical analysis to make sense of armed conflicts. Pasted below is his graphic that he developed from analyzing Wikileaks data about Afghanistan in July 2010. He used R software to generate the graphic.
The gold standard in newsroom graphics: The New York Times
Watch Amanda Cox explain how The New York Times uses the graphics in the print and online edition. The New York Times has been at the cutting edge of using data and graphics. The hour-long video is worth watching for any one interested in using data to communicate.
Friday, December 24, 2010
Data journalism at its best
Watch how journalists at Guardian used the data to paint a hitherto hidden picture of the Afghan war.
http://www.guardian.co.uk/world/datablog/interactive/2010/jul/26/ied-afghanistan-war-logs
















