Data Smart, Ch3, Classifying Tweets using Naive Bayes



Executive Summary

In chapter 3 of the book, Data Smart, by John Foreman, chief data scientist at Mailchimp, the author develops a Naive Bayes classifier in Excel to determine whether tweets containing the word ‘mandrill’ are related to Mailchimps’s Mandrill email-transaction app or not.

Whereas the author used Excel, we choose to use R’s text mining package, tm, in developing a solution, in order to take advantage of tm’s built-in automated text processing tools.

The book, Machine Learning for Hackers by Drew Conway and John Myles White is also a useful resource. We utilize elements of that book’s approach to email spam classification here, since it also uses the tm package in its solution.

# load necessary packages
library(tm) # an R text mining package
library(dplyr) # used for easy manipulation of data frames

# Write a function to create a document corpus and a Term Document Matrix. The preparation involves converting 
# all characters to lower case, removing numbers, removing punctuation and removing stopwords.
createTDM <- function(doc){
  corpus <- Corpus(VectorSource(doc)) # creates a corpus from the source document
  control <- list(stopwords = TRUE, removePunctuation = TRUE, removeNumbers = TRUE) # list of text preparation steps to be applied
  tdm <- TermDocumentMatrix(corpus, control) 
    # converts the terms to lowercase by default
    # tokenizes the document to words by default 
    # removes stopwords. Type 'stopwords()' in the R console to view the list of stopwords 

# calculate the proportional frequency of occurrence of each word. Arrange information in a data frame
genProb <- function(record){
  counts <- cbind(data.frame(record$dimnames$Terms, data.frame(record$v))) # bind together the data of interest from the tdm, the individual words and counts
  names(counts) <- c("word", "count") # assign appropriate names to the columns
  counts <- mutate(counts, prob = count/sum(count)) # create a new column, 'prob', using the mutate command from the dplyr package

# function to return score of testTweet based on word frequencies in the training corpora
genTweetScores <- function(trainingCounts){
  testTweetScore <- as.numeric(vector())
  for (i in 1:length(testTweets)){
    score = 0
    test <- createTDM(testTweets[i])
    wordsIn <- intersect(test$dimnames$Terms, trainingCounts$word)
    wordsNotIn <- setdiff(test$dimnames$Terms, wordsIn)
    score <- score + log(notInValue)*length(wordsNotIn)
    score <- score + sum(log(trainingCounts$prob[match(wordsIn, trainingCounts$word)]))
    testTweetScore[i] <- score
  return (testTweetScore)

# read in and concatenate sets of training tweets into a single character variable
mApp <- paste(readLines("MandrillApp.csv"), collapse = "\n")
other <- paste(readLines("Other.csv"), collapse = "\n")
testTweets <- readLines("testTweets.csv") # this body of individual tweets is not concatenated. Each is analyzed as an individual document

# Assign a proportional frequency value to words in test tweets which were not in the training corpus as per the 
# approach in ML for Hackers. Data Smart used an additive smoothing approach
notInValue = 0.00005 

# inspect the stopwords
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
# Generate the term-document-matrices for the training corpora using the createTDM function
mTDM <- createTDM(mApp)
oTDM <- createTDM(other)

# generate the counts and proportional frequencies for words in the training corpora
mCounts <- genProb(mTDM)
oCounts <- genProb(oTDM)

# arrange the rows in descending order of occurrence. Inspect the top 20 rows
head(arrange(mCounts, -prob), 20) 
##                   word count        prob
## 1             mandrill   100 0.065703022
## 2                email    28 0.018396846
## 3  httphelpmandrillcom    22 0.014454665
## 4                  can    20 0.013140604
## 5            mailchimp    20 0.013140604
## 6             sendgrid    18 0.011826544
## 7              request    16 0.010512484
## 8          mandrillapp    14 0.009198423
## 9              details    13 0.008541393
## 10              emails    13 0.008541393
## 11                send    13 0.008541393
## 12       transactional    12 0.007884363
## 13                just    11 0.007227332
## 14                mind    11 0.007227332
## 15         newsletters    11 0.007227332
## 16             service    11 0.007227332
## 17                 use    11 0.007227332
## 18               using    11 0.007227332
## 19                 via    11 0.007227332
## 20                 api    10 0.006570302
head(arrange(oCounts, -prob), 20) 
##                     word count        prob
## 1               mandrill   138 0.112837285
## 2                  spark    25 0.020441537
## 3                youtube    17 0.013900245
## 4                megaman    15 0.012264922
## 5               acapella    14 0.011447261
## 6                    get    12 0.009811938
## 7  httpyoutubehyxkwyjdia     9 0.007358953
## 8                    man     9 0.007358953
## 9         smoothmcgroove     9 0.007358953
## 10                 vídeo     9 0.007358953
## 11                gostei     7 0.005723630
## 12                  just     6 0.004905969
## 13                  mega     6 0.004905969
## 14                   can     5 0.004088307
## 15                  like     5 0.004088307
## 16                   new     5 0.004088307
## 17                   que     5 0.004088307
## 18                   via     5 0.004088307
## 19              ccpgames     4 0.003270646
## 20            freebooted     4 0.003270646
# Classify the tweets using their word content probabilities for each of the training corpora
classResults <- data.frame(cbind(genTweetScores(mCounts), genTweetScores(oCounts)))
names(classResults) <- c("mScores", "oScores")
classResults <- mutate(classResults, Classification = ifelse(mScores > oScores, "App", "Other"))
##       mScores    oScores Classification
## 1   -60.77438  -95.27575            App
## 2   -60.71495  -97.76066            App
## 3   -37.97522  -48.90482            App
## 4  -114.52122 -167.74667            App
## 5   -83.35133 -118.75262            App
## 6   -48.28697  -71.50622            App
## 7   -29.46878  -39.00133            App
## 8   -35.94576  -39.34240            App
## 9   -95.49885 -123.72886            App
## 10  -54.37375  -79.22790            App
## 11  -32.43307  -24.51166          Other
## 12  -62.14354  -47.63061          Other
## 13  -49.66434  -51.69925            App
## 14  -36.76512  -37.39190            App
## 15  -32.43307  -29.09785          Other
## 16  -52.24005  -24.94163          Other
## 17  -88.17968  -80.89603          Other
## 18  -86.55568  -60.70636          Other
## 19  -69.47132  -71.50622            App
## 20  -91.85400  -85.03120          Other
# The classifier classifies all Mandrill-App related tweets correctly (items 1-10 in the classResults list)
# Three of the unrelated tweets (items 11-20 in the classResults 1ist) are misclassified as 'App', so we can see that there is room for improvement.

Data Smart, Ch 4, Linear Programming


Executive Summary

In Chapter 4 of the book Data Smart, by John Foreman, the author uses Excel’s linear programming tool, Solver, to solve an optimization problem, specifically, minimizing the raw materials cost for a commercial orange juice blend. Consistent with this series, here we use R to solve the same problem, specifically, we invoke R’s lpSolve package.


The problem posed is to optimize the allocation of orange juice orders among a pool of suppliers worldwide, to keep costs to a minimum, yet still meeting taste, color, acidity, supply and contractual constraints.

There are 11 potential suppliers for each of the three months (Jan – March) for which orders are placed. Our objective function has 33 variables, each variable representing an order quantity placed with each vendor for each of the three months. We identify and express 41 constraints for the problem.

The source data and Excel solutions are available at the book’s website.

The R code

# Read source data
objF <- read.csv("objF.csv") # contains the objective function
objF <- objF[["objF"]] # converts objF to a vector
supply <- read.csv("OJ.csv")
# Create constraint matrix
constraintMat <- as.matrix(supply[1:41,1:33]) # 41 rows, 33 variables
# Create constraintDir
constraintDir <- as.vector(supply[["Eqn"]]) # separates out the constraint function e.g. ">=", "<=", ...
# Create constraintRHS
constraintRHS <- as.vector(supply[["RHS"]]) # separates out the constraint values
# Run the linear programming solver. 
optBlend <- lp(direction = 'min', = objF, const.mat = constraintMat,
               const.dir = constraintDir, const.rhs = constraintRHS)
optBlend # returns the minimum of the objective function

Success: the objective function is $1,227,560,. This is the same value returned by Excel’s Solver.

optBlend$solution # returns the order quantities placed, 1000 gals/month, with each of the 11 suppliers, for 
 Jan ( 1-11), Feb (12-22) and Mar (23-33). 

[1] 0.0 13.5 240.0 0.0 0.0 60.9 0.0 0.0 35.0 174.6 76.0 0.0 0.0 240.0 0.0
[16] 75.5 12.4 0.0 114.2 132.3 0.0 25.5 0.0 0.0 280.0 111.2 8.5 0.0 0.0 53.8
[31] 132.6 35.4 78.5

Summary & Conclusion

It is convenient to have the ability to run linear programming optimizations in R. lpSolve does not specify a limit to the number of variables which can be presented to the solver, rather memory is cited as the limiting factor. By contrast, Excel’s Solver is reportedly limited to 200 or fewer variables for most installations.

Other tools are available and may be better suited for heavy duty optimization problems with high variable counts. I have used FICO XPRESS Optimization Suite in the past and found it to be excellent – powerful and easy to use. The author highlighted two good commercial solutions based on his experience. One is the cloud-based Gurobi and the other is the enterprise-focused CPLEX, from IBM.

Happy optimizing!

Data Smart, Ch8, Forecasting Seasonal Demand For Replica Swords


Executive Summary

In chapter 8 of John Foreman’s book, Data Smart, he turns to forecasting demand for a fictional replica sword manufacturing business. The author focuses an Exponential smoothing method which takes Trend and Seasonality into account (ETS), known as the Holt-Winters method.

The code to generate the forecast in R is very, very concise and the author provides it in Chapter 10. We try to add value here by provide code for useful supporting and validation information, such as for the generation of the Auto-Correlation Function.

A fantastic online resource for time-series forecasting is the e-text “Forecasting: Principles and Practice” by Prof. Rob Hyndman, the author of the R forecast package, and George Athanasopoulos.

# make use of the forecast package
# read in the historical sword sales data
sourceData <- read.csv("./SwordForecasting.csv")
swordData <- ts(sourceData[,2], frequency = 12) # converts data to time series type

# plot the historical sword sales data


# generate a 12-month, Holt-Winters forecast (additive linear trend, multiplicative seasonality forecast)
hwForecast <- forecast(ets(swordData, model="ZAM", damped = FALSE), h=12)

par(col.sub = "Grey") # sets the sub-title color to grey
plot(hwForecast, xlab = "Yearly Intervals", ylab = "Demand",
     main = "Sword Demand Forecast For Next 12 Months ",
     sub = "based on Holt-Winters method, 80% & 95% prediction intervals shown", xlim=c(1,5))


# Calculate the auto-correlations and plot the correlogram
autoCF <- acf(hwForecast$residuals, lag.max = 12)



The ACF performance of the forecast residuals is acceptable. The R code required to generate the forecast and the ACF data, using the R ‘forecast’ is very concise. Prof. Hyndman’s ebook, referenced above, is a vital asset for understanding forecasting terms and learning how to forecast with R.

Data Smart, Ch7, Predicting Pregnancy with Ensemble Models – this time using R’s caret package


Executive Summary

In chapter 7 of John Foreman‘s book, Data Smart, he again predicts the pregnancy status of Retail Mart’s customers based on their shopping habits. This time he uses ensemble techniques, specifically bagging and boosting, to build his predictive models.

Since the author also provides the R code for a logistic regression and random forest solution in a later chapter of his book (chapter 10), we take a slightly different approach here. We use the well-known and powerful R caret predictive modeling framework to implement the bagging, boosting and random forest models.

The boosting model proved to have a marginally better AUC score. More importantly, we will see dramatic differences in computation time, with the boosting model being more than 6 times and 13 times faster than the random forest and bagging methods used respectively.

The R code

# Load required packages
library(caret) # See the excellent by the package developer, Max Kuhn
library(ROCR) # required for ROC curve
# Read in the training and test data
rm_train <- read.csv(".\\ensemble_training.csv", header = TRUE, check.names = FALSE) # the training data
rm_train$PREGNANT <- factor(rm_train$PREGNANT, labels = c("Yes", "No")) # using labels "1" and "0" can give rise to problems in caret
rm_test <- read.csv(".\\ensemble_test_orig.csv", header = TRUE, check.names = FALSE) # the test data
rm_test$PREGNANT <- factor(rm_test$PREGNANT, labels = c("Yes", "No"))
# Get ready to set up and use caret
# Set the control parameters for the training step
# 'classProbs' is required to return probability of outcome (pregnant and not pregnant in this case)
# summaryFunction is set to return outcome as a set of binary classification results
# ten-fold cross validation is used by default
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
# Train a random forest model using using the rm_train data
# method is set to 'rf', the designator for the randomForest package
# metric is set to 'ROC', so that the output data is suitable for generating an ROC curve
timeStart <- proc.time() # measure computation time
rf_fit <- train(PREGNANT ~., data = rm_train, method = "rf", trControl = ctrl, metric = "ROC", type = "Classification")
proc.time() - timeStart 
# user time and system time were measured as 276 secs and 6.06 secs respectively 
plot(rf_fit) # plots the fitted results. The best AUC is achieved with 2 random predictors
rf_fit # prints the model information


Fig. 1 Area-Under-Curve scores versus number of variables selected (mtry). mtry = 2 had the best performance, AUC = 0.894.

The information returned by rf_fit was:

Random Forest: 1000 samples, 19 predictors, 2 classes: ‘Yes’, ‘No’

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.8939133 0.9033333 0.7360000 0.03381877 0.03679705 0.06589438
10 0.8676067 0.8826667 0.7226667 0.03944552 0.04059330 0.06051123
19 0.8621867 0.8806667 0.7106667 0.03725257 0.04050827 0.05982350

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

plot(varImp(rf_fit)) # returns variable importance


# Predict on the test set using the random forest model created with the training set
rf_class <- predict(rf_fit, newdata = rm_test) # classification outcome required for the confusion matrix
rf_pred <- predict(rf_fit, newdata = rm_test, type = "prob") # predicted data required for ROC curve
# Generate a confusion matrix for the random forest prediction results
rf_cm <- confusionMatrix(rf_class, rm_test$PREGNANT)

Confusion Matrix and Statistics
Prediction Yes No
Yes 843 13
No 97 47
Accuracy : 0.89
95% CI : (0.8689, 0.9087)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1
Kappa : 0.4109
Mcnemar's Test P-Value : 2.498e-15
Sensitivity : 0.8968
Specificity : 0.7833
Pos Pred Value : 0.9848
Neg Pred Value : 0.3264
Prevalence : 0.9400
Detection Rate : 0.8430
Detection Prevalence : 0.8560
Balanced Accuracy : 0.8401
'Positive' Class : Yes


We do the same thing again for the gbm boosting package:


# Train a boosted classification tree model using caret and the 'gbm' package
# measure the process time
timeStart <- proc.time()
gbm_fit <- train(PREGNANT ~., data = rm_train, method = "gbm", trControl = ctrl, metric = "ROC", verbose = FALSE) # using the same trControl settings
proc.time() - timeStart 
# process times were measured as 42 secs (user) and 0.17 secs (system)
plot(gbm_fit) # plots gbm_fit
gbm_fit # returns gbm_fit



Fig. 2 AUC variations by Max. Tree Depth and number of boosting iterations

Stochastic Gradient Boosting

1000 samples, 19 predictor, 2 classes: ‘Yes’, ‘No’

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

interaction.depth n.trees ROC Sens Spec ROC SD
1 50 0.8866733 0.8980000 0.7120000 0.02918051
1 100 0.8981333 0.8993333 0.7420000 0.02639375
1 150 0.9013867 0.8960000 0.7566667 0.02464031
2 50 0.8958800 0.8946667 0.7433333 0.02869702
2 100 0.9023000 0.8966667 0.7586667 0.02630462
2 150 0.9023867 0.8960000 0.7613333 0.02688160
3 50 0.8999200 0.8946667 0.7546667 0.02664430
3 100 0.9040067 0.8973333 0.7593333 0.02380261
3 150 0.9024867 0.8926667 0.7640000 0.02470665

Tuning parameter ‘shrinkage’ was held constant at a value of 0.1
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 100, interaction.depth = 3
and shrinkage = 0.1.

plot(varImp(gbm_fit)) # returns variable importance


# Predict on the test set using the gbm model
gbm_class <- predict(gbm_fit, newdata = rm_test) # returns classifications
gbm_pred <- predict(gbm_fit, newdata = rm_test, type = "prob") # returns probabilities
# Generate a confusion matrix for the gbm prediction results
gbm_cm <- confusionMatrix(gbm_class, rm_test$PREGNANT)

Confusion Matrix and Statistics for gbm:

Prediction Yes No
Yes 835 13
No 105 47

Accuracy : 0.882
95% CI : (0.8604, 0.9013)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1

Kappa : 0.391
Mcnemar’s Test P-Value : <2e-16

Sensitivity : 0.8883
Specificity : 0.7833
Pos Pred Value : 0.9847
Neg Pred Value : 0.3092
Prevalence : 0.9400
Detection Rate : 0.8350
Detection Prevalence : 0.8480
Balanced Accuracy : 0.8358

‘Positive’ Class : Yes


We go through the process one last time for the bagging package, bag:

# Train a bagging model using caret and the 'bag' package
# Needed to include a bagControl() argument in the train function, which allows choice of tree types
timeStart <- proc.time()
bag_fit <- train(PREGNANT ~., data = rm_train, method = "bag", trControl = ctrl, metric = "ROC", verbose = FALSE,
                 bagControl= bagControl(fit = ctreeBag$fit, pred = ctreeBag$pred, aggregate = ctreeBag$aggregate ),
                 tuneGrid = data.frame(vars = seq(1, 15, by = 2)))
proc.time() - timeStart 
# processing times measured as 581 secs (user) and 1.66 secs (system) respectively
plot(bag_fit) # plots bag_fit
bag_fit # returns bag_fit



Fig. 3 AUC performance vs number of randomly selected predictors, with bag package

Bagged Model

1000 samples, 19 predictors, 2 classes: ‘Yes’, ‘No’

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

vars ROC Sens Spec ROC SD Sens SD Spec SD
1 0.6315533 0.8380000 0.3760000 0.21244297 0.22732932 0.31668106
3 0.8099867 0.8486667 0.5486667 0.11276167 0.13490056 0.18093118
5 0.8508800 0.8500000 0.6206667 0.03930400 0.15229282 0.15926411
7 0.8652133 0.8766667 0.6633333 0.03007760 0.07720699 0.08355602
9 0.8789867 0.9000000 0.6740000 0.03500573 0.04954970 0.08822385
11 0.8747733 0.8760000 0.7040000 0.03370725 0.08211010 0.08556506
13 0.8767667 0.8806667 0.7160000 0.03239221 0.05570447 0.07582193
15 0.8822867 0.8846667 0.7100000 0.03643544 0.05243935 0.06982737

ROC was used to select the optimal model using the largest value.
The final value used for the model was vars = 15.

plot(varImp(bag_fit)) # returns variable importance


# Predict on the test set using the bagging model
bag_class <- predict(bag_fit, newdata = rm_test)
bag_pred <- predict(bag_fit, newdata = rm_test, type = "prob")
# Generate a confusion matrix for the bagging prediction results
bag_cm <- confusionMatrix(bag_class, rm_test$PREGNANT)

Confusion Matrix and Statistics for bag:

Prediction Yes No
Yes 800 12
No 140 48

Accuracy : 0.848
95% CI : (0.8242, 0.8697)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1

Kappa : 0.3258
Mcnemar’s Test P-Value : <2e-16

Sensitivity : 0.8511
Specificity : 0.8000
Pos Pred Value : 0.9852
Neg Pred Value : 0.2553
Prevalence : 0.9400
Detection Rate : 0.8000
Detection Prevalence : 0.8120
Balanced Accuracy : 0.8255

‘Positive’ Class : Yes

Here is the R code used to generate Fig. 4, the plot comparing the ROC curves of each of the models.

# # Generate an ROC curve for the rf method
predRF <- prediction(rf_pred[,1], rm_test$PREGNANT)
perfRF <- performance(predRF, "tpr", "fpr")
plot(perfRF, main = "ROC curves for randomForest, gbm and bag models")
# Generate an ROC curve for the gbm method
pred_gbm <- prediction(gbm_pred[,1], rm_test$PREGNANT)
perf_gbm <- performance(pred_gbm, "tpr", "fpr")
plot(perf_gbm, add = TRUE, col = "blue")
#Generate an ROC curve for the 'bag' method
pred_bag <- prediction(bag_pred[,1], rm_test$PREGNANT)
perf_bag <- performance(pred_bag, "tpr", "fpr")
plot(perf_bag, add = TRUE, col = "red")
# Add legends to the plot
legend("right", legend = c("randomForest", "gbm", "bag"), bty = "n", cex = 1, lty = 1,
       col = c("black", "blue", "red"))


Fig. 4 ROC curve comparison for the random forest, bagging and boosting models

Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression


Executive summary

In chapter 6 of the book, Data Smart, by John Foreman, Chief Data Scientist at Mailchimp, the synthesized challenge is to predict which of a retailers’ customers are pregnant based on a dataset of their shopping records.

A logistic regression model is used. The model is trained on the shopping records of 500 pregnant customers and 500 non-pregnant customers. The model is then tested on a dataset of 1000 different customers, each of whose pregnancy status is known.

The author develops a logistic regression solution using Excel. We are free to user R to solve the problem and we do so using the code here. The dataset is available at the book’s website.

The R Code

# load the required libraries
library(ROCR) # used to create the ROC curve
library(caret) # used to generate a confusion matrix

# A function to classify pregnancy outcomes based on probabilities of being pregnant.The arguments are: the threshold probability for a positive pregnancy prediction, the predicted pregnancy probabilities of each customer and the known pregnancy status of the customers
manualClass <- function(linThreshold, predictedPreg, knownResult){
  pregPredClass <- predictedPreg # the predicted probability of pregnancy
  pregPredClass[pregPredClass >= linThreshold] <- 1 # classifies as 1 if above threshold
  pregPredClass[pregPredClass < linThreshold] <- 0 # classifies as zero if below threshold
  linPregClass <- cbind(pregPredClass, knownResult) # stores predicted classification & known result

# Function to calculate and plot the Confusion Matrix parameters for the assigned linThreshold value
confusMatrix <- function(pregPredClass, knownResult){
  pregPredClass <- as.factor(pregPredClass) # confusionMatrix() requires factor variables
  levels(pregPredClass) <- c("notPregnant", "Pregnant") # assigns level labels
  testResult <- as.factor(knownResult) # confusionMatrix() requires factor variables
  levels(testResult) <- c("notPregnant", "Pregnant")
  confusionMat <- confusionMatrix(pregPredClass, testResult, positive = "Pregnant")

# Read in the training data and test set data
trainData <- read.csv('./retailmart.csv')
testData <- read.csv('./testSet.csv')

# Isolate the test set result column (#20) from the predictor variables
testResult <- testData[, 20] 
testNoResult <- testData[, -20]

# Train a logistic regression model
pTrainLogit <- glm(PREGNANT ~., data = trainData, family = binomial)

# Predict pregnancies for the test set based on the trained model
newdata <- testData
pregPredLogit <- predict.glm(pTrainLogit, newdata, type = "response") # probability values of pregnancy

# Manually classify the logistic regression predictions. The threshold is set deliberately high because false positives are particularly undesirable
pregPredClass <- manualClass(0.96, pregPredLogit, testResult)

# Generate the Confusion Matrix for the prediction results for the threshold value of 0.96
confusionMat <- confusMatrix(pregPredClass[,1], testResult)
## Confusion Matrix and Statistics
##              Reference
## Prediction    notPregnant Pregnant
##   notPregnant         936       37
##   Pregnant              4       23
##                Accuracy : 0.959           
##                  95% CI : (0.9448, 0.9704)
##     No Information Rate : 0.94            
##     P-Value [Acc > NIR] : 0.004943        
##                   Kappa : 0.5105          
##  Mcnemar's Test P-Value : 5.806e-07       
##             Sensitivity : 0.3833          
##             Specificity : 0.9957          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.9620          
##              Prevalence : 0.0600          
##          Detection Rate : 0.0230          
##    Detection Prevalence : 0.0270          
##       Balanced Accuracy : 0.6895          
##        'Positive' Class : Pregnant         

We can choose to use an alternative threshold (0.5) and show its corresponding confusion matrix for comparison purposes. At this threshold, we see that there are more true positives, however, there are also many more false positives, which are undesirable.

pregPredClassAlt <- manualClass(0.5, pregPredLogit, testResult)
confusionMatAlt <- confusMatrix(pregPredClass[,1], testResult)
## Confusion Matrix and Statistics
##              Reference
## Prediction    notPregnant Pregnant
##   notPregnant         936       37
##   Pregnant              4       23
##                Accuracy : 0.959           
##                  95% CI : (0.9448, 0.9704)
##     No Information Rate : 0.94            
##     P-Value [Acc > NIR] : 0.004943        
##                   Kappa : 0.5105          
##  Mcnemar's Test P-Value : 5.806e-07       
##             Sensitivity : 0.3833          
##             Specificity : 0.9957          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.9620          
##              Prevalence : 0.0600          
##          Detection Rate : 0.0230          
##    Detection Prevalence : 0.0270          
##       Balanced Accuracy : 0.6895          
##        'Positive' Class : Pregnant         
# Plot ROC curve using RORC package
predLogit <- prediction(pregPredLogit, testResult)
perfLogit <- performance(predLogit, "tpr", "fpr")




The results agree with those generated by the logistic regression model developed by the author using Excel.

Data Smart, Ch5, Network Graphs and Community Detection


Executive Summary

Chapter 5 of John Foreman‘s book Data Smart looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.

We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. The modularity score returned, 0.549, is marginally better than the value of 0.546 calculated using the Excel-based process and is the same as that calculated in the book using Gephi, the well-known open source graph analysis tool.

Process Outline

The same wine promotion dataset, as used in Chapter 2 to explain and develop the k-means clustering algorithms, is used here.

The author implements the following steps to perform community detection (cluster analysis) on the data in Excel:

  1. Prepare the data for graph analysis
  2. Calculate the cosine similarity in wine promotion purchases for each customer pairing. The customers form the graph vertices and the cosine similarity defines the edge connections in the graph
  3. Prune the graph using the r-Neighborhood technique, keeping only edges with similarity scores at or above the 80th percentile
  4. Calculate new scores for the inter-customer relationships, taking into account the expectations of their being connected had the graph been connected randomly
  5. Calculate graph modularity arising from cluster assignments of the customers. Adjust cluster assignments to maximize graph modularity.

A note on graph modularity

Graph modularity works by assessing a probability-of-connection score for each pair of nodes in a network, taking into account the degree (number of connections) of each node in the graph. When clusters are formed, the graph modularity score is incremented for each pair within a cluster that has a connection between them; the modularity score is penalized for each pair within a cluster that has no connection between them. The process of modularity maximization adjusts the number of clusters and the cluster assignments to find the optimum graph modularity score.

The R Code

# R script to implement Community Detection on the wine dataset from Chapter 5 of the book Data Smart, by John Foreman

# load the R packageslibrary(igraph) # R package for network analysis and visualizationlibrary(lsa) # required for the cosine() function, used to generate a similarity matrix
# read in the wine data, convert the data to a matrix, calculate similarities between vertices
wineData <- read.csv(".\\WineNetwork.csv", header = TRUE, check.names = FALSE)
wineNW <- wineData[, -c(1,2,3,4,5,6,7)] # remove the metadata columns
wineNW[] <- 0 # replace NULL entries with 0
wineNW_mat <- as.matrix(wineNW)# convert the data.frame to a matrix
cosineSim_mat <- cosine(wineNW_mat) # calculate the cosine similarities of all the customersdiag(cosineSim_mat) <- 0 # assign self-similarities *the diagonal in the matrix) to be of value 0
# prune the data using the r-neighborhood approach, create the graph object, plot the graph
cosineSim_mat[(cosineSim_mat < 0.5)] <-0 # prune the graph. Similarities of strength <0.5 are set to 0
cosineSim_mat[(cosineSim_mat >= 0.5)] <-1
cosineSim_graph <- graph.adjacency(cosineSim_mat,mode = 'undirected', weighted = TRUE) # convert the matrix to an igraph objectplot(cosineSim_graph)


Fig. 1 A view of the pruned graph before modularity maximization is applied

# create a matrix of inter-vertex scores
membershipVector <- numeric(100) # creates a vector of length 100. Each element is set equal to zero. 
modMatrix <- mod.matrix(cosineSim_graph, membership = membershipVector) # generates a matrix of inter-vertex scores
# use the function in igraph to optimize the graph modularity 
wineMod_opt = # selects optimum cluster assignments 
modularity(cosineSim_graph, membership =wineMod_opt$membership) # calculates final graph modularity
for (i in 1:max(wineMod_opt$membership)){print(wineMod_opt$names[wineMod_opt$membership == i]) # print list of members belonging to cluster i
} # prints out members by cluster assignment

Graph community structure calculated with the multi level algorithm
Number of communities (best split): 6
Modularity (best split): 0.5489076

Community #1
[1] “Anderson” “Bell” “Campbell” “Cook” “Cox” “Flores” “Gray”
[8] “Jenkins” “Johnson” “Moore” “Morris” “Phillips” “Rodriguez” “Russell”
[15] “Smith”

Community #2
[1] “Adams” “Bailey” “Brown” “Carter” “Cruz” “Diaz” “Green”
[8] “Hernandez” “Hill” “Hughes” “James” “King” “Lewis” “Murphy”
[15] “Myers” “Perez” “Perry” “Rivera” “Robinson” “Ross” “Stewart”
[22] “Taylor” “Walker” “Watson”

Community #3
[1] “Allen” “Baker” “Barnes” “Clark” “Cooper” “Evans” “Foster”
[8] “Garcia” “Gomez” “Gonzalez” “Harris” “Kelly” “Lee” “Long”
[15] “Lopez” “Miller” “Morales” “Nguyen” “Powell” “Price” “Ramirez”
[22] “Reed” “Reyes” “Richardson” “Roberts” “Rogers” “Sanchez” “Sanders”
[29] “Scott” “Thomas” “Thompson” “Turner” “Ward” “White” “Williams”
[36] “Wood” “Wright” “Young”

Community #4
[1] “Parker”

Community #5
[1] “Bennett” “Brooks” “Edwards” “Gutierrez” “Jones” “Morgan” “Nelson”
[8] “Ortiz” “Sullivan” “Torres” “Wilson”

Community #6
[1] “Butler” “Collins” “Davis” “Fisher” “Hall” “Howard” “Jackson” “Martin”
[9] “Martinez” “Mitchell” “Peterson”

# evaluate number of orders by cluster, for each offer
sumOrdersByCluster <- (aggregate(t(wineNW),by = list(wineMod_opt$membership),sum))
wineConclusion <- cbind(t(sumOrdersByCluster[,-1]), wineData)
# print out the top features each cluster
for (i in 1:max(wineMod_opt$membership)){print(i)print(head(wineConclusion[order(wineConclusion[,i], decreasing =TRUE),1:12]))

Cluster #1 Likes Pinot Noir
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V24 12 0 0 0 0 0 24 September Pinot Noir 6 34 Italy
 V26 11 0 3 0 0 1 26 October Pinot Noir 144 83 Australia
 V17 7 0 0 0 0 0 17 July Pinot Noir 12 47 Germany
 V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France
 V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia
 V16 1 0 3 1 0 0 16 June Merlot 72 88 California

Cluster #2 prefers low volume purchases
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V30 0 15 3 0 1 3 30 December Malbec 6 54 France
 V7 0 14 5 0 0 0 7 March Prosecco 6 40 Australia
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon
 V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa
 V13 0 5 0 0 1 0 13 May Merlot 6 43 Chile

Cluster # 3 appears to like sparkling wines
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V22 0 0 14 0 1 6 22 August Champagne 72 63 France
 V31 0 0 14 1 1 1 31 December Champagne 72 89 France
 V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile
 V4 0 0 10 0 1 1 4 February Champagne 72 48 France
 V9 0 0 10 0 0 0 9 April Chardonnay 144 57 Chile
 V14 0 0 9 0 0 0 14 June Merlot 72 64 Chile

Cluster #4 consists of a single customer, Parker, who is isolated in the graph
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V11 0 0 5 1 1 6 11 May Champagne 72 85 France
 V16 1 0 3 1 0 0 16 June Merlot 72 88 California
 V20 0 0 5 1 0 0 20 August Cabernet Sauvignon 72 82 Italy
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V31 0 0 14 1 1 1 31 December Champagne 72 89 France
 V1 0 0 5 0 0 5 1 January Malbec 72 56 France

Cluster #5 dominated by Espumante 
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa
 V3 0 0 4 0 2 0 3 February Espumante 144 32 Oregon
 V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V4 0 0 10 0 1 1 4 February Champagne 72 48 France
 V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile

 Cluster #6 likes French wines
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V11 0 0 5 1 1 6 11 May Champagne 72 85 France
 V22 0 0 14 0 1 6 22 August Champagne 72 63 France
 V1 0 0 5 0 0 5 1 January Malbec 72 56 France
 V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France
 V28 0 1 1 0 0 4 28 November Cabernet Sauvignon 12 56 France
 V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia

The graph analysis package, igraph, is a powerful resource, well capable of implementing and scaling up the modularity maximization process developed by the author using Excel in Data Smart.



Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering


Executive Summary

This is a walk-through of a customer segmentation process using R’s skmeans package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman‘s book, Data Smart.

The approach followed is that outlined by the author. The major difference is that the author, as per his teaching objectives, built his solution manually in Excel. On the other hand, we are free to take advantage of R’s off-the-shelf packages to solve the problem and we do so here.

This walk-through includes R function calls to calculate silhouette values as measures of segmentation effectiveness.

A note on the silhouette

The silhouette calculation for an assigned data point requires three summary statistics:

  • A – the average distance to the data points in its nearest neighboring cluster
  • B – the average distance to the data points in its assigned cluster
  • C – the maximum of A and B

The silhouette for a data point is (A-B)/C. It is a measure of the robustness of the assignment. A value close to 1 indicates that a data point is very well suited to its cluster. A value close to (or less than) zero indicates ambiguity about the cluster assignment.

The overall silhouette value is the mean of the silhouette values of the individual data points.

R Code

# Load R packages
library(skmeans) # used for k-medians algorithm
library(cluster) # required for calling the silhouette function
library(dplyr) # an excellent tool for for summarizing, ordering and filtering data features
# Read in the sales promotion dataset. Remove meta columns, convert NA values to zeroes
kmcDF <- read.csv(".\\wineKMC_matrix.csv") #reads in data as a dataframe
wineDF <- t(kmcDF[,-c(1,2,3,4,5,6,7)]) # new variable, metadata columns removed, dataframe transposed
wineDF[] <- 0 # replaces blank entries with zeros 
wineMatrix <-as.matrix(wineDF) #converts the dataframe to type matrix
# Segment the customers into 5 clusters
partition <- skmeans(wineMatrix, 5) 
# Look at the segmentation outcome summary
partition # returns a summary statement for the process
partition$cluster # returns a vector showing cluster assignment for each customer
# Create a vector of customer names for each cluster
cluster_1 <- names(partition$cluster[partition$cluster == 1])
cluster_2 <- names(partition$cluster[partition$cluster == 2])
cluster_3 <- names(partition$cluster[partition$cluster == 3])
cluster_4 <- names(partition$cluster[partition$cluster == 4])
cluster_5 <- names(partition$cluster[partition$cluster == 5])
# Examine one of the clusters, as an example

The skmeans object, partition, informs us that the result of the segmentation is:
## a hard spherical k-means partition of 100 objects into 5 classes.
## Class sizes: 14, 18, 16, 29, 23

The customers assigned to cluster_1 are shown here. You may see variation in cluster assignments and sizes from segmentation run to segmentation run.
## [1] “Butler” “Clark” “Collins” “Cooper” “Davis” “Fisher”
## [7] “Garcia” “Gomez” “Hall” “Howard” “Jackson” “Lopez”
## [13] “Martin” “Martinez” “Parker” “Powell” “Reed” “Sanchez”
## [19] “Sanders” “Thomas” “Thompson” “Ward” “White”

# Examine characteristics of each cluster using the aggregation function to sum the number of purchases for each promotion by cluster
clusterCounts <- t(aggregate(wineDF, by=list(partition$cluster), sum)[,2:33]) # taken directly from Data Smart
clusterCounts <- cbind(kmcDF[,c(1:7)], clusterCounts) # add back the meta data columns
# The arrange function in the dplyr package is used to view the characteristics of the differnt clusters 
View(arrange(clusterCounts, -clusterCounts$"1"))
View(arrange(clusterCounts, -clusterCounts$"2"))
View(arrange(clusterCounts, -clusterCounts$"3"))
View(arrange(clusterCounts, -clusterCounts$"4"))
View(arrange(clusterCounts, -clusterCounts$"5"))

Some sample results are shown:


Example 1: cluster_1 has a liking for French wines


Example 2: cluster_2 favors low-volume purchases


Example 3: cluster_5 members like to drink pinot noir

Computing the silhouette with R

# Compare performance of the k=5 clustering process with the k=4 clustering using the silhouette function. The closer the value is to 1, the better.
silhouette_k5 <- silhouette(partition)

## Silhouette of 100 units in 5 clusters from silhouette.skmeans(x = partition) :
## Cluster sizes and average silhouette widths:
## 23 21 17 23 16
## 0.08447426 0.26400515 0.46935604 0.14643094 0.30914071
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.01138 0.08694 0.17350 0.23780 0.37390 0.65650


partition_k4 <- skmeans(wineMatrix, 4)
silhouette_k4 <- silhouette(partition_k4)

## Silhouette of 100 units in 4 clusters from silhouette.skmeans(x = partition_k4) :
## Cluster sizes and average silhouette widths:
## 17 17 43 23
## 0.3077986 0.4817904 0.1001556 0.2274353
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.04095 0.09243 0.16950 0.22960 0.36070 0.66550



Typical silhouette values for repeated runs with k=5 and k=4 were 0.23 and 0.24 respectively and we can conclude that the segmentations were about equally effective. In this case, visual examination of the segment metadata is useful for selecting which of the k=4 and k=5 partitions better matches the requirements of the business.



Being ‘Data Smart’ with Predixion Insight

9781118661468 cover.indd

 Figure 1: Gartner Magic Quadrant for Advanced Analytics Platforms, 2015 & Data Smart

Executive Summary

I signed up for a trial of Predixion’s predictive analytics software, Predixion Insight and decided to test it on some of the data problems posed in John Foreman‘s book, Data Smart. Specifically, I examined the application of Insight’s classification, segmentation and forecast models.

What I particularly liked about using Insight were the following:

  • the environment feels familiar, because it is – it’s Excel, so you feel close to the data
  • the Data Profile function provides an immediate and vivid summary of the data features. This is a feature I reverted to frequently
  • the wizards used for setting up the models are intuitive and easy to use – there was no coding required
  • the cloud-based Insight Explorer site is powerful and useful for reporting

Predixion Insight can make it easier for analysts to undertake predictive analysis projects. However, an organization will still benefit from the support of a knowledgeable data scientist to oversee data science practices and to support the interpretation and validation of model results.



When comparing analytics platforms, it is easy for the discussion to become about the technology rather than the benefits delivered. An easy-to-use tool with easy-to-interpret results may be a better fit for many organizations than a difficult-to-use or difficult-to-interpret tool with marginally better models.

Predixion Software promises:

  1. ease-of-use – it is primarily used from within Excel, and
  2. speed-to-insight – it has a wide range of built-in statistical modeling and summary features to help users understand their data and interpret modeling results

These two benefits can constitute a competitive advantage if they help make predictive analysis easier for or more widely adopted in organizations.

Design of Experiment

I reviewed Data Smart in a previous blog post. The book was written by John Foreman, chief data scientist at Mailchimp and serves as a great introduction to data science.

In Data Smart, Foreman teaches data science techniques using hands-on data-processing and model development in Excel. The reader is immersed in the data when solving the problems presented. Foreman deliberately eschewed teaching a programming language, in order to focus attention on learning and understanding the data science techniques at hand.

There are several similarities between the Predixion approach and the Data Smart approach. For both, Excel is the interface for data input and processing. For both, the user feels physically close to the data during modeling. For both, knowledge of a programming language is not critical for setting up and running the models.

While the approaches are similar, they are not the same. Predixion is a commercial software offering with enhanced functionality including automation, reporting and data exploration, among others. Although programming is not required for the built-in models, it is probable that you will want to extend Insight by building your own models and integrating them with the platform.

I created a public workspace in the Predixion cloud to share the data, modeling packages and detailed results for the data challenges considered here.

Data Challenge #1 – Customer Segmentation


A hypothetical wine retailer has a dataset of sales promotion data for its business. The data is composed of 32 records of promotional campaign results. Each record contains information on wine variety, discount offered, volume, country of origin etc. and on which customers made purchases as a result of the promotions. The challenge posed is to cluster the customers (there are 100 of them) into meaningful segments so that each can be sent more of the promotions they are interested in and fewer of those that they are not interested in.

Foreman uses:

  • k-means and k-medians clustering (Chapter 2) and
  • network graph analysis using modularity maximization (Chapter 5)

to build his clustering models. The k-means procedure requires that the number of clusters, k, is specified beforehand. Foreman uses a method called the silhouette to interpret the effectiveness of the clustering process for different values of k, in order to support the selection of an appropriate k value.

Insight offers four built-in segmentation algorithms, two of which are based on k-means. One of these algorithms originated with Microsoft, the other with the R community. Both of these algorithms offer the facility to automatically select the optimum number of clusters, k. I chose this option.



 Figure 2. Segment profile identifying five clusters from the available data

Figure 2 shows the results from the Insight R k-means algorithm. The results are consistent with those from the Data Smart analysis. Insight grouped the customers into five segments. Four of these segments have readily identifiable and dominant attributes, specifically, customers in these segments have preferences for:

  • high volume purchases of sparkling wine
  • low volume purchases
  • high volume purchases of French wines
  • Pinot Noir

The fifth segment has many different attributes and is less easy to categorize. Scrolling down the customer list in the Segment Profile, we can visually identify customers who have affinities with particular segments.

Figure 3 shows a more traditional network graph view of the clusters. I chose the option to accent the clusters by the degree of discount offered – the lower the discount, the darker the shading in this case. The graph indicates that the Pinot Noir segment is associated with the lowest discount levels and that the France/Champagne segment is associated with the highest levels.


 Figure 3 Cluster network graph, highlighting degree of discount by cluster 

It is possible to modify the approach to explicitly identify the customers in each cluster. In this case we start by looking at the purchases data only. See Figure 4.


Figure 4 Snippet of customer purchases data

Figure 5 shows us the network graph segmentation by customer grouping. In this case, the selected accent shows us which customer segment is most likely to purchase promotion offer #1.


Figure 5 Segments Network by customer

The full list of customers in each segment can obtained by selecting the Segment Profile view in Insight Explorer and rendering the results to Excel.

Data Challenge #2 – Classification using Logistic Regression

In chapter 6 of Data Smart, the synthesized challenge is to predict which of a retailers’s customers are pregnant based on a dataset of their shopping records. A logistic regression model is used. The model is trained on the shopping records of 500 pregnant customers and 500 non-pregnant customers. The model is then tested on a dataset of 1000 different customers, each of whose pregnancy status is known.

It is good practice to explore your data in order to get an understanding of its distribution and inter-relationships. Two particularly useful built-in operations in Insight are the Data Profile and Data Correlation functions.

Data Profile provides a visual display and a statistical summary of each of the features of a dataset. This one-click function provides a view of the distribution of each of the data features and helps identify potential outliers.


Figure 6 Data profile view  of the retail shopping history dataset

Clicking the Data Correlation tab generates a correlation matrix, which can help identify potential relationships among data features and with the target variable.


Figure 7: Correlation Matrix for the retail shopping history dataset

The modeling process is done in two steps:

  • First, a classification model is created by clicking on the ‘Classify’ icon. There are several classification algorithms to choose from, including Logistic Regression, Naive Bayes, Neural Networks and Decision Trees. I chose to train classification models for each of the algorithms since, in this case, the computation cost for selecting all the models is low.


Figure 8 Snippet of Predixion Insight ribbon in Excel

  • Second, the models’ performances are evaluated on the test set data by clicking on the Validate pull down menu and generating an Accuracy Chart


 Figure 9 ROC curves comparing classification model performances 

Figure 9 shows a comparison of model performances. The Logistic Regression, Neural Network and Naive Bayes models have approximately equal performance, with AUC values around 0.9. The Decision Tree has markedly inferior performance to the other models. The performance of the Logistic Regression model is comparable to the results from the manually created model in Excel using the Data Smart process.

Figure 9 does highlight an advantage of Insight in this application. Once the data has been prepared and uploaded for the modeling step, the set up cost of using the other built-in classification models is free.

Data Challenge #3 – Forecasting demand

Foreman teaches forecasting techniques using a simple dataset of 36 months of historical sales data. He constructs a procedure in Excel, to forecast one year ahead’s worth of monthly sales using simple exponential smoothing (E) at first and later incorporating adjustments for linear trend (T) and multiplicative seasonality (S). He also shows how to construct confidence intervals for the forecast using Monte Carlo techniques.

Insight has five built-in time-series forecasting algorithms based on either ARIMA models or Auto-Regressive Tree (ARTXP) models or a model which is a combination of both.

Figure 10 shows a comparison of three forecasts:

  • The Data Smart forecast using the ETS method in Excel
  • An ETS forecast in R, using the fpp package developed by the time-series forecast expert and author of the free online text, ‘Forecasting: Principles and Practice’, Prof. Rob Hyndman
  • The BAAX model in Insight, a blend of ARIMA and ARTXP algorithms



 Figure 10 Comparison of one-year-ahead monthly demand forecasts

The R and Excel forecasts are in very close agreement, which is to be expected since they are both based on the same algorithm. R is particularly economical in this application, requiring just a couple of lines of code to create the forecast. The R forecast is also particularly information rich. For example, it automatically calculates and displays confidence intervals around the forecasts and provides several error measures to assess the fit of the forecast to the historical data (not shown here).

The BAAX forecast is the best performing of the default models in Insight, although the algorithm is significantly different from the Holt-Winters ETS model. Some limited customization of the model is allowed under the advanced settings in Insight but, if an ETS-based forecast involving multiplicative seasonality is required, it is best to import your own R model into Insight to do the job.

Insight does provide information on the performance of its forecast against historical data using its visual display tools – see Figure 11. By dragging the forecast interval to the start of the time series, we can see the forecast and historical demand for each month. If we highlight a particular month, such as month 39 in Fig. 11, its predictor months are highlighted in the time series (months 27, 36, 37 and 38 in this example). The information at the top left of the display shows the predicted point value and range data for month 39. These data are 259 +/- 12.4, in this case.

Figure 11 – Performance of Insight ARIMA model against historical data & forecast showing dependencies


Predixion Insight can provide analysts with valuable data summaries of their data and help them to set up predictive analytics models. Some care and expertise may be needed to set up the models properly and to perform model validation afterwards.

I did not have the opportunity to look at the full scope of the tool this time. Insight has some built in operations that look particularly attractive such as the feature selection and analyze key influencers operations. That is work for another day.




Review of Data Smart (the book) by John Foreman – It’s Excellent

9781118661468 cover.indd

I highly recommend John Foreman’s book: ‘Data Smart – Using Data Science to Transform Data into Insight’. The author’s approach is unique – he teaches data science skills without teaching programming. His approach works because he limits the newness of each subject item to one dimension, that being the data science technique at hand. Each skill is introduced in the familiar environment of Excel and its spreadsheets. If the author had introduced a programming language at the same time, the difficulty of staying focused on the technique being taught would have been increased greatly.

Each technique introduced is explained by first posing and then working through a business problem. Each problem has an accompanying Excel dataset. For each challenge, you are soon immersed in data – your Excel skills will be stretched and improved and you will need to work hard to keep abreast of each lesson. It is worth the effort for the reader because tangible learning occurs as the data is worked through and the synthesized business challenges are demonstrably solved.


The techniques you will learn include:

  • Cluster analysis using k-means and k-medians algorithms
  • Document classification using a Naive Bayes algorithm
  • Optimization modeling using linear programming methods
  • Community detection in network graphs using modularity maximization
  • Linear and logistic regression, statistical measures, ROC curves
  • Ensemble models based on bagging and boosting techniques
  • Forecasting based on exponential smoothing with trend and seasonality (ETS)
  • Outlier detection using k-nearest-neighbors and local outlier factors

In addition to teaching techniques, the author sprinkles the text with conclusions he has formed based on his years of data science practice and on a healthy skepticism for some of the overblown claims made for data science. His advice includes the following:


  • Beware the poorly posed problem. Data scientists must learn to communicate with stakeholders in order to help frame business problems properly.
  • You will generally get more benefit by spending time selecting good data and engineering better data features than searching for a marginally better predictive model. A great model cannot deliver good performance with poor data.
  • Remember, industry is not academia. In industry, analytics is a results-driven pursuit. The best model is the one that strikes a balance between performance and usability. If a model is never used, it is worthless.
  • Mailchimp, the author’s employer, has TB of data under management. However, it neither needs nor uses all this data for analysis and model training. Instead it aggregates its data to the level the business requires. This aggregate dataset is typically 10GB in size. Such aggregation enables Mailchimp to use R in production.
  • Data science is not the most important function in the organization. The purpose of the data science function is to serve the business and help it achieve its goals. The purpose of data science is not to build models for their own sake.



The author introduces the R programming language in the final chapter, as a gateway to doing further work in data science. R is very powerful but, for a newcomer, it can appear to be a black-box tool for solving data analysis problems. As the author explains, R has great performance but it is not great for explaining the inner workings of models to aspiring data scientists.

At the outset, the author explained that he preferred clarity over mathematical correctness when writing the book. He did an excellent job. The aspiring data science will learn by doing in the familiar and manageable environment of Excel.


The datasets used for each of the modeling exercises are available at the publisher’s website for the book.

The author’s blog is invariably insightful and provocative.

The author presented a workshop at the 2014 Strata Conference in Santa Clara. Videos of his presentation are viewable at




Predixion Delivers At ‘The Last Mile Of Analytics’

Agenda: ‘From Data Science to Business Impact’, with Jamie MacLennan, Co-founder and CTO of Predixion, at the Data Science Dojo meetup, Redmond, WA, Feb. 3, 2015.



In Brief

Prediction has developed an impressive, cloud-based and user friendly predictive analytics framework which can be used on the data organizations have today. The session consisted of:

  1. an update on their distinctive and powerful prediction analysis platform, Predixion Insight
  2. a use case in the healthcare sector detailing the process of predicting and reducing hospital re-admissions. Re-admissions were reduced to 14% from 21%.
  3. a vision of a predictive analytics tool at device level on the Internet of Things
  4. an overview of other applications

What is distinctive about Predixion Insight?


What stands out about Predixion Insight is:

  1. it uses Excel as the client interface so that the environment has a familiar feel
  2. it has impressive visualization capability – e.g. I was impressed that (i) thumbnail visualizations and data sidebars are automatically generated showing the data distributions and summary statistics for each column; (ii) performance comparisons of different models, such as ROC graphs, can be very quickly generated and these were demonstrated during the presentation
  3. it features a patent-pending approach Predixion calls the Machine Language Semantic Model (MLSM) which is purposed to intelligently categorize and prepare data for analysis , for example, it helps with data cleansing, outlier detection and re-normalization.
  4. the entire analytic workflow can be captured as a portable object, in what is called an MLSM package (see figure). This workflow can be reused or redeployed on other devices such as phones or devices on the Internet of Things (IOT) as a REST API or through Java or .Net libraries.
  5. Predixion focused their efforts on developing an analytics framework. For machine learning, they leveraged a variety of integrated Machine Learning packages such as R, SQL Server Analysis Services and Mahout rather than reprogramming algorithms from scratch. In the case of R, users can select from a range of integrated R scripts and packages or import and use their own user defined scripts.
  6. Predixion literature stresses the importance of ‘The Last Mile of Analytics’ – getting the predictive insight to the location or the person where the insight can be acted upon. An example of this in practice was making prediction insights available directly to nurses. See the CHS case study below.
  7. In conversation, Predixion emphasized the importance of marrying the technical, data-science centered capability in Predixion Insights with the domain expertise of customers to solve customer problems.

Achieving reductions in patient re-admissions from 21% to 14%


The Carolinas Healthcare System is one of the largest hospital systems in the country. Healthcare providers nationally strive to reduce re-admissions to get better outcomes for patients, to avoid penalties and, especially, to reduce the $25 billion that preventable hospital re-admissions cost annually.

CHS explains that the point of maximum benefit of predictive analytics is at the point of interface with the patient.

CHS worked with Predixion to develop a dynamic scoring and classification system to assess the risk of re-admission for patients. The classifications were: ‘very high’, ‘high’, ‘medium’ and ‘low’. The hospitals were able to establish action policies based on the classifications. Each patient’s classification status was available to nurses on their monitors, i.e. the point of interface with the patients so that they were able to take the appropriate actions established by the hospital policy.

CHS has rolled out the system to 13 hospitals. The accuracy of the model in predicting re-admissions is 80%. Accuracy of up to 86% has been reported when the system is used in conjunction with targeted intervention programs.

As a result of the improvements, CHS reported that Hospital re-admissions fell from 21% to 14% in just 18 months.

Analytics on the Internet of Things (IoT)


This section was an unexpected lesson. We heard that many sensors in the IoT record as few as four data features such as date, time, sensorID and deviceID. We also heard that it is possible to put the MLSM on almost any device, where it can help with data preparation at the point of collection.

The IoT is an area of focus for Predixion. The possible number of connected devices on the IoT, when it becomes mainstream, is enormous.

Other applications

Two of the other use cases mentioned during the talk were:

  • predictive analytics for marketing, such as lead scoring and prediction of lead performance.
  • preventative maintenance – Predixion partners with one of its investors, GE, to predict equipment failures


A free 30 day trial of Prediction Insight is available.

An excellent 11-minute video presentation explaining the architecture, features and capabilities of Predixion Insight is available.

As a comparator to the CHS case study, Manuel Amunategui, lead data scientist at Providence Health & Services, has an excellent 30-minute video presentation, including an R code walk through, on the work Providence is doing and has done to reduce their hospital readmissions to 8%, according to their measurement criteria. The presentation is a good mix of a description of the business problem, with informative graphs, and technical content.


Predixion appears to have a very good offering for predictive analytics using the data organizations have today.

On the technology meetup circuit in Seattle, I often hear about the wonderful, leading-edge technologies that are being developed for data management and analytics (e.g. live analytics on massive volumes of streaming, unstructured data) but, in many cases, these technologies are competing to become the industry standard and the industry has not made up its mind yet. Further, not all clients are ready to manage exotic data structures. In fact, probably the majority are running their businesses using RDBMS and will be for the foreseeable future.

Predixion has chosen to leverage proven machine learning and analytics technologies and to focus on delivering a practical and easy-to-use analytics framework. They appear to be well positioned to help companies get valuable insight using the very large datasets they already have.

That Predixion has targeted IoT applications was a revelation. They appear to be close to this market through their work with GE and this activity is worth keeping an eye on for the future.

I will be test-driving Predixion in the coming weeks and, given time, will provide feedback.


Many thanks to Jamie for presenting and for the Q&A at the end and to Data Science Dojo for organizing.