This is a walk-through of a customer segmentation process using R’s skmeans package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman‘s book, Data Smart.

The approach followed is that outlined by the author. The major difference is that the author, as per his teaching objectives, built his solution manually in Excel. On the other hand, we are free to take advantage of R’s off-the-shelf packages to solve the problem and we do so here.

This walk-through includes R function calls to calculate silhouette values as measures of segmentation effectiveness.

**A note on the silhouette**

The silhouette calculation for an assigned data point requires three summary statistics:

- A – the average distance to the data points in its nearest neighboring cluster
- B – the average distance to the data points in its assigned cluster
- C – the maximum of A and B

The silhouette for a data point is (A-B)/C. It is a measure of the robustness of the assignment. A value close to 1 indicates that a data point is very well suited to its cluster. A value close to (or less than) zero indicates ambiguity about the cluster assignment.

The overall silhouette value is the mean of the silhouette values of the individual data points.

#### R Code

# Read in the sales promotion dataset. Remove meta columns, convert NA values to zeroes kmcDF <- read.csv(".\\wineKMC_matrix.csv") #reads in data as a dataframe wineDF <- t(kmcDF[,-c(1,2,3,4,5,6,7)]) # new variable, metadata columns removed, dataframe transposed wineDF[is.na(wineDF)] <- 0 # replaces blank entries with zeros wineMatrix <-as.matrix(wineDF) #converts the dataframe to type matrix # Segment the customers into 5 clusters partition <- skmeans(wineMatrix, 5) # Look at the segmentation outcome summary partition # returns a summary statement for the process partition$cluster # returns a vector showing cluster assignment for each customer # Create a vector of customer names for each cluster cluster_1 <- names(partition$cluster[partition$cluster == 1]) cluster_2 <- names(partition$cluster[partition$cluster == 2]) cluster_3 <- names(partition$cluster[partition$cluster == 3]) cluster_4 <- names(partition$cluster[partition$cluster == 4]) cluster_5 <- names(partition$cluster[partition$cluster == 5]) # Examine one of the clusters, as an example cluster_1

The skmeans object, partition, informs us that the result of the segmentation is:

## a hard spherical k-means partition of 100 objects into 5 classes.

## Class sizes: 14, 18, 16, 29, 23

The customers assigned to cluster_1 are shown here. You may see variation in cluster assignments and sizes from segmentation run to segmentation run.

## [1] “Butler” “Clark” “Collins” “Cooper” “Davis” “Fisher”

## [7] “Garcia” “Gomez” “Hall” “Howard” “Jackson” “Lopez”

## [13] “Martin” “Martinez” “Parker” “Powell” “Reed” “Sanchez”

## [19] “Sanders” “Thomas” “Thompson” “Ward” “White”

# Examine characteristics of each cluster using the aggregation function to sum the number of purchases for each promotion by cluster clusterCounts <- t(aggregate(wineDF, by=list(partition$cluster), sum)[,2:33]) # taken directly from Data Smart clusterCounts <- cbind(kmcDF[,c(1:7)], clusterCounts) # add back the meta data columns # The arrange function in the dplyr package is used to view the characteristics of the differnt clusters View(arrange(clusterCounts, -clusterCounts$"1")) View(arrange(clusterCounts, -clusterCounts$"2")) View(arrange(clusterCounts, -clusterCounts$"3")) View(arrange(clusterCounts, -clusterCounts$"4")) View(arrange(clusterCounts, -clusterCounts$"5"))

Some sample results are shown:

Example 1: cluster_1 has a liking for French wines

Example 2: cluster_2 favors low-volume purchases

Example 3: cluster_5 members like to drink pinot noir

**Computing the silhouette with R**

# Compare performance of the k=5 clustering process with the k=4 clustering using the silhouette function. The closer the value is to 1, the better. silhouette_k5 <- silhouette(partition) summary(silhouette_k5) plot(silhouette_k5)

## Silhouette of 100 units in 5 clusters from silhouette.skmeans(x = partition) :

## Cluster sizes and average silhouette widths:

## 23 21 17 23 16

## 0.08447426 0.26400515 0.46935604 0.14643094 0.30914071

## Individual silhouette widths:

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## -0.01138 0.08694 0.17350 0.23780 0.37390 0.65650

partition_k4 <- skmeans(wineMatrix, 4) silhouette_k4 <- silhouette(partition_k4) plot(silhouette_k4) summary(silhouette_k4)

## Silhouette of 100 units in 4 clusters from silhouette.skmeans(x = partition_k4) :

## Cluster sizes and average silhouette widths:

## 17 17 43 23

## 0.3077986 0.4817904 0.1001556 0.2274353

## Individual silhouette widths:

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## -0.04095 0.09243 0.16950 0.22960 0.36070 0.66550

**Conclusion**

Typical silhouette values for repeated runs with k=5 and k=4 were 0.23 and 0.24 respectively and we can conclude that the segmentations were about equally effective. In this case, visual examination of the segment metadata is useful for selecting which of the k=4 and k=5 partitions better matches the requirements of the business.

Gerard

@PugData