**Executive Summary**

In chapter 7 of John Foreman‘s book, Data Smart, he again predicts the pregnancy status of Retail Mart’s customers based on their shopping habits. This time he uses ensemble techniques, specifically bagging and boosting, to build his predictive models.

Since the author also provides the R code for a logistic regression and random forest solution in a later chapter of his book (chapter 10), we take a slightly different approach here. We use the well-known and powerful R caret predictive modeling framework to implement the bagging, boosting and random forest models.

The boosting model proved to have a marginally better AUC score. More importantly, we will see dramatic differences in computation time, with the boosting model being more than 6 times and 13 times faster than the random forest and bagging methods used respectively.

**The R code**

# Load required packages library(caret) # See the excellent http://topepo.github.io/caret/index.html by the package developer, Max Kuhn library(ggplot2) library(e1071) library(ROCR) # required for ROC curve # Read in the training and test data rm_train <- read.csv(".\\ensemble_training.csv", header = TRUE, check.names = FALSE) # the training data rm_train$PREGNANT <- factor(rm_train$PREGNANT, labels = c("Yes", "No")) # using labels "1" and "0" can give rise to problems in caret rm_test <- read.csv(".\\ensemble_test_orig.csv", header = TRUE, check.names = FALSE) # the test data rm_test$PREGNANT <- factor(rm_test$PREGNANT, labels = c("Yes", "No"))

# Get ready to set up and use caret # Set the control parameters for the training step # 'classProbs' is required to return probability of outcome (pregnant and not pregnant in this case) # summaryFunction is set to return outcome as a set of binary classification results # ten-fold cross validation is used by default ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) # Train a random forest model using using the rm_train data # method is set to 'rf', the designator for the randomForest package # metric is set to 'ROC', so that the output data is suitable for generating an ROC curve set.seed(88) timeStart <- proc.time() # measure computation time rf_fit <- train(PREGNANT ~., data = rm_train, method = "rf", trControl = ctrl, metric = "ROC", type = "Classification") proc.time() - timeStart # user time and system time were measured as 276 secs and 6.06 secs respectively

plot(rf_fit) # plots the fitted results. The best AUC is achieved with 2 random predictors rf_fit # prints the model information

**Fig. 1 Area-Under-Curve scores versus number of variables selected (mtry). **mtry = 2 had the best performance, AUC = 0.894.

The information returned by rf_fit was:

Random Forest: 1000 samples, 19 predictors, 2 classes: ‘Yes’, ‘No’

No pre-processing

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

mtry ROC Sens Spec ROC SD Sens SD Spec SD

2 0.8939133 0.9033333 0.7360000 0.03381877 0.03679705 0.06589438

10 0.8676067 0.8826667 0.7226667 0.03944552 0.04059330 0.06051123

19 0.8621867 0.8806667 0.7106667 0.03725257 0.04050827 0.05982350

ROC was used to select the optimal model using the largest value.

The final value used for the model was mtry = 2.

plot(varImp(rf_fit)) # returns variable importance

# Predict on the test set using the random forest model created with the training set rf_class <- predict(rf_fit, newdata = rm_test) # classification outcome required for the confusion matrix rf_pred <- predict(rf_fit, newdata = rm_test, type = "prob") # predicted data required for ROC curve # Generate a confusion matrix for the random forest prediction results rf_cm <- confusionMatrix(rf_class, rm_test$PREGNANT) rf_cm Confusion Matrix and Statistics Reference Prediction Yes No Yes 843 13 No 97 47 Accuracy : 0.89 95% CI : (0.8689, 0.9087) No Information Rate : 0.94 P-Value [Acc > NIR] : 1 Kappa : 0.4109 Mcnemar's Test P-Value : 2.498e-15 Sensitivity : 0.8968 Specificity : 0.7833 Pos Pred Value : 0.9848 Neg Pred Value : 0.3264 Prevalence : 0.9400 Detection Rate : 0.8430 Detection Prevalence : 0.8560 Balanced Accuracy : 0.8401 'Positive' Class : Yes

We do the same thing again for the gbm boosting package:

# Train a boosted classification tree model using caret and the 'gbm' package # measure the process time timeStart <- proc.time() gbm_fit <- train(PREGNANT ~., data = rm_train, method = "gbm", trControl = ctrl, metric = "ROC", verbose = FALSE) # using the same trControl settings proc.time() - timeStart # process times were measured as 42 secs (user) and 0.17 secs (system) plot(gbm_fit) # plots gbm_fit gbm_fit # returns gbm_fit

**Fig. 2 AUC variations by Max. Tree Depth and number of boosting iterations**

Stochastic Gradient Boosting

1000 samples, 19 predictor, 2 classes: ‘Yes’, ‘No’

No pre-processing

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

interaction.depth n.trees ROC Sens Spec ROC SD

1 50 0.8866733 0.8980000 0.7120000 0.02918051

1 100 0.8981333 0.8993333 0.7420000 0.02639375

1 150 0.9013867 0.8960000 0.7566667 0.02464031

2 50 0.8958800 0.8946667 0.7433333 0.02869702

2 100 0.9023000 0.8966667 0.7586667 0.02630462

2 150 0.9023867 0.8960000 0.7613333 0.02688160

3 50 0.8999200 0.8946667 0.7546667 0.02664430

3 100 0.9040067 0.8973333 0.7593333 0.02380261

3 150 0.9024867 0.8926667 0.7640000 0.02470665

Tuning parameter ‘shrinkage’ was held constant at a value of 0.1

ROC was used to select the optimal model using the largest value.

The final values used for the model were n.trees = 100, interaction.depth = 3

and shrinkage = 0.1.

plot(varImp(gbm_fit)) # returns variable importance

# Predict on the test set using the gbm model gbm_class <- predict(gbm_fit, newdata = rm_test) # returns classifications gbm_pred <- predict(gbm_fit, newdata = rm_test, type = "prob") # returns probabilities # Generate a confusion matrix for the gbm prediction results gbm_cm <- confusionMatrix(gbm_class, rm_test$PREGNANT) gbm_cm

Confusion Matrix and Statistics for gbm:

Reference

Prediction Yes No

Yes 835 13

No 105 47

Accuracy : 0.882

95% CI : (0.8604, 0.9013)

No Information Rate : 0.94

P-Value [Acc > NIR] : 1

Kappa : 0.391

Mcnemar’s Test P-Value : <2e-16

Sensitivity : 0.8883

Specificity : 0.7833

Pos Pred Value : 0.9847

Neg Pred Value : 0.3092

Prevalence : 0.9400

Detection Rate : 0.8350

Detection Prevalence : 0.8480

Balanced Accuracy : 0.8358

‘Positive’ Class : Yes

We go through the process one last time for the bagging package, bag:

# Train a bagging model using caret and the 'bag' package # Needed to include a bagControl() argument in the train function, which allows choice of tree types timeStart <- proc.time() bag_fit <- train(PREGNANT ~., data = rm_train, method = "bag", trControl = ctrl, metric = "ROC", verbose = FALSE, bagControl= bagControl(fit = ctreeBag$fit, pred = ctreeBag$pred, aggregate = ctreeBag$aggregate ), tuneGrid = data.frame(vars = seq(1, 15, by = 2))) proc.time() - timeStart # processing times measured as 581 secs (user) and 1.66 secs (system) respectively

plot(bag_fit) # plots bag_fit bag_fit # returns bag_fit

**Fig. 3 AUC performance vs number of randomly selected predictors, with bag package**

Bagged Model

1000 samples, 19 predictors, 2 classes: ‘Yes’, ‘No’

No pre-processing

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 900, 900, 900, 900, 900, 900, …

Resampling results across tuning parameters:

vars ROC Sens Spec ROC SD Sens SD Spec SD

1 0.6315533 0.8380000 0.3760000 0.21244297 0.22732932 0.31668106

3 0.8099867 0.8486667 0.5486667 0.11276167 0.13490056 0.18093118

5 0.8508800 0.8500000 0.6206667 0.03930400 0.15229282 0.15926411

7 0.8652133 0.8766667 0.6633333 0.03007760 0.07720699 0.08355602

9 0.8789867 0.9000000 0.6740000 0.03500573 0.04954970 0.08822385

11 0.8747733 0.8760000 0.7040000 0.03370725 0.08211010 0.08556506

13 0.8767667 0.8806667 0.7160000 0.03239221 0.05570447 0.07582193

15 0.8822867 0.8846667 0.7100000 0.03643544 0.05243935 0.06982737

ROC was used to select the optimal model using the largest value.

The final value used for the model was vars = 15.

plot(varImp(bag_fit)) # returns variable importance

Confusion Matrix and Statistics for bag:

Reference

Prediction Yes No

Yes 800 12

No 140 48

Accuracy : 0.848

95% CI : (0.8242, 0.8697)

No Information Rate : 0.94

P-Value [Acc > NIR] : 1

Kappa : 0.3258

Mcnemar’s Test P-Value : <2e-16

Sensitivity : 0.8511

Specificity : 0.8000

Pos Pred Value : 0.9852

Neg Pred Value : 0.2553

Prevalence : 0.9400

Detection Rate : 0.8000

Detection Prevalence : 0.8120

Balanced Accuracy : 0.8255

‘Positive’ Class : Yes

Here is the R code used to generate Fig. 4, the plot comparing the ROC curves of each of the models.

# # Generate an ROC curve for the rf method predRF <- prediction(rf_pred[,1], rm_test$PREGNANT) perfRF <- performance(predRF, "tpr", "fpr") plot(perfRF, main = "ROC curves for randomForest, gbm and bag models") # Generate an ROC curve for the gbm method pred_gbm <- prediction(gbm_pred[,1], rm_test$PREGNANT) perf_gbm <- performance(pred_gbm, "tpr", "fpr") plot(perf_gbm, add = TRUE, col = "blue") #Generate an ROC curve for the 'bag' method pred_bag <- prediction(bag_pred[,1], rm_test$PREGNANT) perf_bag <- performance(pred_bag, "tpr", "fpr") plot(perf_bag, add = TRUE, col = "red") # Add legends to the plot legend("right", legend = c("randomForest", "gbm", "bag"), bty = "n", cex = 1, lty = 1, col = c("black", "blue", "red"))

**Fig. 4 ROC curve comparison for the random forest, bagging and boosting models**