Data Smart, Ch3, Classifying Tweets using Naive Bayes

    Executive Summary In chapter 3 of the book, Data Smart, by John Foreman, chief data scientist at Mailchimp, the author develops a Naive Bayes classifier in Excel to determine whether tweets containing the word ‘mandrill’ are related to Mailchimps’s Mandrill email-transaction app or not. Whereas the author used Excel, we choose to use …

Continue reading ‘Data Smart, Ch3, Classifying Tweets using Naive Bayes’ »

Data Smart, Ch8, Forecasting Seasonal Demand For Replica Swords

Executive Summary In chapter 8 of John Foreman’s book, Data Smart, he turns to forecasting demand for a fictional replica sword manufacturing business. The author focuses an Exponential smoothing method which takes Trend and Seasonality into account (ETS), known as the Holt-Winters method. The code to generate the forecast in R is very, very concise …

Continue reading ‘Data Smart, Ch8, Forecasting Seasonal Demand For Replica Swords’ »

Data Smart, Ch7, Predicting Pregnancy with Ensemble Models – this time using R’s caret package

Executive Summary In chapter 7 of John Foreman‘s book, Data Smart, he again predicts the pregnancy status of Retail Mart’s customers based on their shopping habits. This time he uses ensemble techniques, specifically bagging and boosting, to build his predictive models. Since the author also provides the R code for a logistic regression and random …

Continue reading ‘Data Smart, Ch7, Predicting Pregnancy with Ensemble Models – this time using R’s caret package’ »

Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression

Executive summary In chapter 6 of the book, Data Smart, by John Foreman, Chief Data Scientist at Mailchimp, the synthesized challenge is to predict which of a retailers’ customers are pregnant based on a dataset of their shopping records. A logistic regression model is used. The model is trained on the shopping records of 500 …

Continue reading ‘Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression’ »

Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering

Executive Summary This is a walk-through of a customer segmentation process using R’s skmeans package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman‘s book, Data Smart. The approach followed is that outlined by the author. The major difference is that the author, as per his teaching objectives, built his solution …

Continue reading ‘Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering’ »

Being ‘Data Smart’ with Predixion Insight

 Figure 1: Gartner Magic Quadrant for Advanced Analytics Platforms, 2015 & Data Smart Executive Summary I signed up for a trial of Predixion’s predictive analytics software, Predixion Insight and decided to test it on some of the data problems posed in John Foreman‘s book, Data Smart. Specifically, I examined the application of Insight’s classification, segmentation and forecast models. What I …

Continue reading ‘Being ‘Data Smart’ with Predixion Insight’ »

Review of Data Smart (the book) by John Foreman – It’s Excellent

I highly recommend John Foreman’s book: ‘Data Smart – Using Data Science to Transform Data into Insight’. The author’s approach is unique – he teaches data science skills without teaching programming. His approach works because he limits the newness of each subject item to one dimension, that being the data science technique at hand. Each skill is introduced in the …

Continue reading ‘Review of Data Smart (the book) by John Foreman – It’s Excellent’ »

Zillow opens the kimono – reveals R, Python and Graphlab Create underneath

Meetup: ‘Data Science at Zillow – the Zestimate and Beyond‘, at the Python Data Science Meetup, Seattle, Jan 27th, 2015. Slidedeck: http://slidesha.re/1ALRbvU   In brief Zillow described their 20TB dataset and the technology they use to estimate house values for more than 110 million homes in the US. Zillow uses the statistical programming language, R, for both prototyping and production. The use of Python in Zillow is …

Continue reading ‘Zillow opens the kimono – reveals R, Python and Graphlab Create underneath’ »

Data Scientist Interview Questions

I was recently interviewed for a Data Scientist position at a post-IPO, cloud, Platform-as-a-Service company. My interviewer was knowledgeable and enthusiastic about data science and machine learning. His interview style was to first ask me to grade my familiarity and competency with a list of machine learning topics on a scale of 1 to 10. Based on …

Continue reading ‘Data Scientist Interview Questions’ »