Data Smart, Ch2, Customer Segmentation With R Using K-Medians Clustering


Executive Summary

This is a walk-through of a customer segmentation process using R’s skmeans package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman‘s book, Data Smart.

The approach followed is that outlined by the author. The major difference is that the author, as per his teaching objectives, built his solution manually in Excel. On the other hand, we are free to take advantage of R’s off-the-shelf packages to solve the problem and we do so here.

This walk-through includes R function calls to calculate silhouette values as measures of segmentation effectiveness.

A note on the silhouette

The silhouette calculation for an assigned data point requires three summary statistics:

  • A – the average distance to the data points in its nearest neighboring cluster
  • B – the average distance to the data points in its assigned cluster
  • C – the maximum of A and B

The silhouette for a data point is (A-B)/C. It is a measure of the robustness of the assignment. A value close to 1 indicates that a data point is very well suited to its cluster. A value close to (or less than) zero indicates ambiguity about the cluster assignment.

The overall silhouette value is the mean of the silhouette values of the individual data points.

R Code

# Load R packages
library(skmeans) # used for k-medians algorithm
library(cluster) # required for calling the silhouette function
library(dplyr) # an excellent tool for for summarizing, ordering and filtering data features
# Read in the sales promotion dataset. Remove meta columns, convert NA values to zeroes
kmcDF <- read.csv(".\\wineKMC_matrix.csv") #reads in data as a dataframe
wineDF <- t(kmcDF[,-c(1,2,3,4,5,6,7)]) # new variable, metadata columns removed, dataframe transposed
wineDF[] <- 0 # replaces blank entries with zeros 
wineMatrix <-as.matrix(wineDF) #converts the dataframe to type matrix
# Segment the customers into 5 clusters
partition <- skmeans(wineMatrix, 5) 
# Look at the segmentation outcome summary
partition # returns a summary statement for the process
partition$cluster # returns a vector showing cluster assignment for each customer
# Create a vector of customer names for each cluster
cluster_1 <- names(partition$cluster[partition$cluster == 1])
cluster_2 <- names(partition$cluster[partition$cluster == 2])
cluster_3 <- names(partition$cluster[partition$cluster == 3])
cluster_4 <- names(partition$cluster[partition$cluster == 4])
cluster_5 <- names(partition$cluster[partition$cluster == 5])
# Examine one of the clusters, as an example

The skmeans object, partition, informs us that the result of the segmentation is:
## a hard spherical k-means partition of 100 objects into 5 classes.
## Class sizes: 14, 18, 16, 29, 23

The customers assigned to cluster_1 are shown here. You may see variation in cluster assignments and sizes from segmentation run to segmentation run.
## [1] “Butler” “Clark” “Collins” “Cooper” “Davis” “Fisher”
## [7] “Garcia” “Gomez” “Hall” “Howard” “Jackson” “Lopez”
## [13] “Martin” “Martinez” “Parker” “Powell” “Reed” “Sanchez”
## [19] “Sanders” “Thomas” “Thompson” “Ward” “White”

# Examine characteristics of each cluster using the aggregation function to sum the number of purchases for each promotion by cluster
clusterCounts <- t(aggregate(wineDF, by=list(partition$cluster), sum)[,2:33]) # taken directly from Data Smart
clusterCounts <- cbind(kmcDF[,c(1:7)], clusterCounts) # add back the meta data columns
# The arrange function in the dplyr package is used to view the characteristics of the differnt clusters 
View(arrange(clusterCounts, -clusterCounts$"1"))
View(arrange(clusterCounts, -clusterCounts$"2"))
View(arrange(clusterCounts, -clusterCounts$"3"))
View(arrange(clusterCounts, -clusterCounts$"4"))
View(arrange(clusterCounts, -clusterCounts$"5"))

Some sample results are shown:


Example 1: cluster_1 has a liking for French wines


Example 2: cluster_2 favors low-volume purchases


Example 3: cluster_5 members like to drink pinot noir

Computing the silhouette with R

# Compare performance of the k=5 clustering process with the k=4 clustering using the silhouette function. The closer the value is to 1, the better.
silhouette_k5 <- silhouette(partition)

## Silhouette of 100 units in 5 clusters from silhouette.skmeans(x = partition) :
## Cluster sizes and average silhouette widths:
## 23 21 17 23 16
## 0.08447426 0.26400515 0.46935604 0.14643094 0.30914071
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.01138 0.08694 0.17350 0.23780 0.37390 0.65650


partition_k4 <- skmeans(wineMatrix, 4)
silhouette_k4 <- silhouette(partition_k4)

## Silhouette of 100 units in 4 clusters from silhouette.skmeans(x = partition_k4) :
## Cluster sizes and average silhouette widths:
## 17 17 43 23
## 0.3077986 0.4817904 0.1001556 0.2274353
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.04095 0.09243 0.16950 0.22960 0.36070 0.66550



Typical silhouette values for repeated runs with k=5 and k=4 were 0.23 and 0.24 respectively and we can conclude that the segmentations were about equally effective. In this case, visual examination of the segment metadata is useful for selecting which of the k=4 and k=5 partitions better matches the requirements of the business.



Leave a Reply

Your email address will not be published.