Data Smart, Ch6, Predicting Customer Pregnancies with Logistic Regression

baby_mobile_correlation

Executive summary

In chapter 6 of the book, Data Smart, by John Foreman, Chief Data Scientist at Mailchimp, the synthesized challenge is to predict which of a retailers’ customers are pregnant based on a dataset of their shopping records.

A logistic regression model is used. The model is trained on the shopping records of 500 pregnant customers and 500 non-pregnant customers. The model is then tested on a dataset of 1000 different customers, each of whose pregnancy status is known.

The author develops a logistic regression solution using Excel. We are free to user R to solve the problem and we do so using the code here. The dataset is available at the book’s website.

The R Code

# load the required libraries
library(ROCR) # used to create the ROC curve
library(caret) # used to generate a confusion matrix

# A function to classify pregnancy outcomes based on probabilities of being pregnant.The arguments are: the threshold probability for a positive pregnancy prediction, the predicted pregnancy probabilities of each customer and the known pregnancy status of the customers
manualClass <- function(linThreshold, predictedPreg, knownResult){
  pregPredClass <- predictedPreg # the predicted probability of pregnancy
  pregPredClass[pregPredClass >= linThreshold] <- 1 # classifies as 1 if above threshold
  pregPredClass[pregPredClass < linThreshold] <- 0 # classifies as zero if below threshold
  linPregClass <- cbind(pregPredClass, knownResult) # stores predicted classification & known result
}

# Function to calculate and plot the Confusion Matrix parameters for the assigned linThreshold value
confusMatrix <- function(pregPredClass, knownResult){
  pregPredClass <- as.factor(pregPredClass) # confusionMatrix() requires factor variables
  levels(pregPredClass) <- c("notPregnant", "Pregnant") # assigns level labels
  testResult <- as.factor(knownResult) # confusionMatrix() requires factor variables
  levels(testResult) <- c("notPregnant", "Pregnant")
  confusionMat <- confusionMatrix(pregPredClass, testResult, positive = "Pregnant")
}

# Read in the training data and test set data
trainData <- read.csv('./retailmart.csv')
testData <- read.csv('./testSet.csv')

# Isolate the test set result column (#20) from the predictor variables
testResult <- testData[, 20] 
testNoResult <- testData[, -20]

# Train a logistic regression model
pTrainLogit <- glm(PREGNANT ~., data = trainData, family = binomial)

# Predict pregnancies for the test set based on the trained model
newdata <- testData
pregPredLogit <- predict.glm(pTrainLogit, newdata, type = "response") # probability values of pregnancy

# Manually classify the logistic regression predictions. The threshold is set deliberately high because false positives are particularly undesirable
pregPredClass <- manualClass(0.96, pregPredLogit, testResult)

# Generate the Confusion Matrix for the prediction results for the threshold value of 0.96
confusionMat <- confusMatrix(pregPredClass[,1], testResult)
print(confusionMat)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    notPregnant Pregnant
##   notPregnant         936       37
##   Pregnant              4       23
##                                           
##                Accuracy : 0.959           
##                  95% CI : (0.9448, 0.9704)
##     No Information Rate : 0.94            
##     P-Value [Acc > NIR] : 0.004943        
##                                           
##                   Kappa : 0.5105          
##  Mcnemar's Test P-Value : 5.806e-07       
##                                           
##             Sensitivity : 0.3833          
##             Specificity : 0.9957          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.9620          
##              Prevalence : 0.0600          
##          Detection Rate : 0.0230          
##    Detection Prevalence : 0.0270          
##       Balanced Accuracy : 0.6895          
##                                           
##        'Positive' Class : Pregnant         

We can choose to use an alternative threshold (0.5) and show its corresponding confusion matrix for comparison purposes. At this threshold, we see that there are more true positives, however, there are also many more false positives, which are undesirable.

pregPredClassAlt <- manualClass(0.5, pregPredLogit, testResult)
confusionMatAlt <- confusMatrix(pregPredClass[,1], testResult)
print(confusionMatAlt)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    notPregnant Pregnant
##   notPregnant         936       37
##   Pregnant              4       23
##                                           
##                Accuracy : 0.959           
##                  95% CI : (0.9448, 0.9704)
##     No Information Rate : 0.94            
##     P-Value [Acc > NIR] : 0.004943        
##                                           
##                   Kappa : 0.5105          
##  Mcnemar's Test P-Value : 5.806e-07       
##                                           
##             Sensitivity : 0.3833          
##             Specificity : 0.9957          
##          Pos Pred Value : 0.8519          
##          Neg Pred Value : 0.9620          
##              Prevalence : 0.0600          
##          Detection Rate : 0.0230          
##    Detection Prevalence : 0.0270          
##       Balanced Accuracy : 0.6895          
##                                           
##        'Positive' Class : Pregnant         
# Plot ROC curve using RORC package
predLogit <- prediction(pregPredLogit, testResult)
perfLogit <- performance(predLogit, "tpr", "fpr")
plot(perfLogit)

ROC

 

Conclusion

The results agree with those generated by the logistic regression model developed by the author using Excel.

Leave a Reply

Your email address will not be published.