Data Smart, Ch3, Classifying Tweets using Naive Bayes

    Executive Summary In chapter 3 of the book, Data Smart, by John Foreman, chief data scientist at Mailchimp, the author develops a Naive Bayes classifier in Excel to determine whether tweets containing the word ‘mandrill’ are related to Mailchimps’s Mandrill email-transaction app or not. Whereas the author used Excel, we choose to use …

Continue reading ‘Data Smart, Ch3, Classifying Tweets using Naive Bayes’ »

Data Smart, Ch7, Predicting Pregnancy with Ensemble Models – this time using R’s caret package

Executive Summary In chapter 7 of John Foreman‘s book, Data Smart, he again predicts the pregnancy status of Retail Mart’s customers based on their shopping habits. This time he uses ensemble techniques, specifically bagging and boosting, to build his predictive models. Since the author also provides the R code for a logistic regression and random …

Continue reading ‘Data Smart, Ch7, Predicting Pregnancy with Ensemble Models – this time using R’s caret package’ »

Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression

Executive summary In chapter 6 of the book, Data Smart, by John Foreman, Chief Data Scientist at Mailchimp, the synthesized challenge is to predict which of a retailers’ customers are pregnant based on a dataset of their shopping records. A logistic regression model is used. The model is trained on the shopping records of 500 …

Continue reading ‘Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression’ »

Data Smart, Ch5, Network Graphs and Community Detection

Executive Summary Chapter 5 of John Foreman‘s book Data Smart looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data. We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. …

Continue reading ‘Data Smart, Ch5, Network Graphs and Community Detection’ »

Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering

Executive Summary This is a walk-through of a customer segmentation process using R’s skmeans package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman‘s book, Data Smart. The approach followed is that outlined by the author. The major difference is that the author, as per his teaching objectives, built his solution …

Continue reading ‘Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering’ »

Seattle UseR (R Programming) Meetup, June 4, 2014

This evening’s topic: ‘Extending the R language to the Enterprise with TERR and Spotfire’. To see the meetup agenda, click here. Location: Tibco Software offices, 1700 Westlake Ave., Seattle Tibco and R – a shared history Before there was R there was S, the statistical software language developed by John Chambers at Bell Labs in the 1970s.  R is an open-source, free implementation of S. S+ is …

Continue reading ‘Seattle UseR (R Programming) Meetup, June 4, 2014’ »