Data Smart, Ch5, Network Graphs and Community Detection

Modularity_maximization

Executive Summary

Chapter 5 of John Foreman‘s book Data Smart looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.

We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. The modularity score returned, 0.549, is marginally better than the value of 0.546 calculated using the Excel-based process and is the same as that calculated in the book using Gephi, the well-known open source graph analysis tool.

Process Outline

The same wine promotion dataset, as used in Chapter 2 to explain and develop the k-means clustering algorithms, is used here.

The author implements the following steps to perform community detection (cluster analysis) on the data in Excel:

  1. Prepare the data for graph analysis
  2. Calculate the cosine similarity in wine promotion purchases for each customer pairing. The customers form the graph vertices and the cosine similarity defines the edge connections in the graph
  3. Prune the graph using the r-Neighborhood technique, keeping only edges with similarity scores at or above the 80th percentile
  4. Calculate new scores for the inter-customer relationships, taking into account the expectations of their being connected had the graph been connected randomly
  5. Calculate graph modularity arising from cluster assignments of the customers. Adjust cluster assignments to maximize graph modularity.

A note on graph modularity

Graph modularity works by assessing a probability-of-connection score for each pair of nodes in a network, taking into account the degree (number of connections) of each node in the graph. When clusters are formed, the graph modularity score is incremented for each pair within a cluster that has a connection between them; the modularity score is penalized for each pair within a cluster that has no connection between them. The process of modularity maximization adjusts the number of clusters and the cluster assignments to find the optimum graph modularity score.

The R Code

# R script to implement Community Detection on the wine dataset from Chapter 5 of the book Data Smart, by John Foreman

# load the R packageslibrary(igraph) # R package for network analysis and visualizationlibrary(lsa) # required for the cosine() function, used to generate a similarity matrix
# read in the wine data, convert the data to a matrix, calculate similarities between vertices
wineData <- read.csv(".\\WineNetwork.csv", header = TRUE, check.names = FALSE)
wineNW <- wineData[, -c(1,2,3,4,5,6,7)] # remove the metadata columns
wineNW[is.na(wineNW)] <- 0 # replace NULL entries with 0
wineNW_mat <- as.matrix(wineNW)# convert the data.frame to a matrix
cosineSim_mat <- cosine(wineNW_mat) # calculate the cosine similarities of all the customersdiag(cosineSim_mat) <- 0 # assign self-similarities *the diagonal in the matrix) to be of value 0
# prune the data using the r-neighborhood approach, create the graph object, plot the graph
cosineSim_mat[(cosineSim_mat < 0.5)] <-0 # prune the graph. Similarities of strength <0.5 are set to 0
cosineSim_mat[(cosineSim_mat >= 0.5)] <-1
cosineSim_graph <- graph.adjacency(cosineSim_mat,mode = 'undirected', weighted = TRUE) # convert the matrix to an igraph objectplot(cosineSim_graph)

DS_Ch5_graph_plot

Fig. 1 A view of the pruned graph before modularity maximization is applied

# create a matrix of inter-vertex scores
membershipVector <- numeric(100) # creates a vector of length 100. Each element is set equal to zero. 
modMatrix <- mod.matrix(cosineSim_graph, membership = membershipVector) # generates a matrix of inter-vertex scores
 
# use the multilevel.community function in igraph to optimize the graph modularity 
wineMod_opt = multilevel.community(cosineSim_graph) # selects optimum cluster assignments 
wineMod_opt
modularity(cosineSim_graph, membership =wineMod_opt$membership) # calculates final graph modularity
for (i in 1:max(wineMod_opt$membership)){print(wineMod_opt$names[wineMod_opt$membership == i]) # print list of members belonging to cluster i
} # prints out members by cluster assignment

Graph community structure calculated with the multi level algorithm
Number of communities (best split): 6
Modularity (best split): 0.5489076

Community #1
[1] “Anderson” “Bell” “Campbell” “Cook” “Cox” “Flores” “Gray”
[8] “Jenkins” “Johnson” “Moore” “Morris” “Phillips” “Rodriguez” “Russell”
[15] “Smith”

Community #2
[1] “Adams” “Bailey” “Brown” “Carter” “Cruz” “Diaz” “Green”
[8] “Hernandez” “Hill” “Hughes” “James” “King” “Lewis” “Murphy”
[15] “Myers” “Perez” “Perry” “Rivera” “Robinson” “Ross” “Stewart”
[22] “Taylor” “Walker” “Watson”

Community #3
[1] “Allen” “Baker” “Barnes” “Clark” “Cooper” “Evans” “Foster”
[8] “Garcia” “Gomez” “Gonzalez” “Harris” “Kelly” “Lee” “Long”
[15] “Lopez” “Miller” “Morales” “Nguyen” “Powell” “Price” “Ramirez”
[22] “Reed” “Reyes” “Richardson” “Roberts” “Rogers” “Sanchez” “Sanders”
[29] “Scott” “Thomas” “Thompson” “Turner” “Ward” “White” “Williams”
[36] “Wood” “Wright” “Young”

Community #4
[1] “Parker”

Community #5
[1] “Bennett” “Brooks” “Edwards” “Gutierrez” “Jones” “Morgan” “Nelson”
[8] “Ortiz” “Sullivan” “Torres” “Wilson”

Community #6
[1] “Butler” “Collins” “Davis” “Fisher” “Hall” “Howard” “Jackson” “Martin”
[9] “Martinez” “Mitchell” “Peterson”

# evaluate number of orders by cluster, for each offer
sumOrdersByCluster <- (aggregate(t(wineNW),by = list(wineMod_opt$membership),sum))
wineConclusion <- cbind(t(sumOrdersByCluster[,-1]), wineData)
 
# print out the top features each cluster
for (i in 1:max(wineMod_opt$membership)){print(i)print(head(wineConclusion[order(wineConclusion[,i], decreasing =TRUE),1:12]))
}

Cluster #1 Likes Pinot Noir
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V24 12 0 0 0 0 0 24 September Pinot Noir 6 34 Italy
 V26 11 0 3 0 0 1 26 October Pinot Noir 144 83 Australia
 V17 7 0 0 0 0 0 17 July Pinot Noir 12 47 Germany
 V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France
 V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia
 V16 1 0 3 1 0 0 16 June Merlot 72 88 California

Cluster #2 prefers low volume purchases
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V30 0 15 3 0 1 3 30 December Malbec 6 54 France
 V7 0 14 5 0 0 0 7 March Prosecco 6 40 Australia
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon
 V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa
 V13 0 5 0 0 1 0 13 May Merlot 6 43 Chile

Cluster # 3 appears to like sparkling wines
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V22 0 0 14 0 1 6 22 August Champagne 72 63 France
 V31 0 0 14 1 1 1 31 December Champagne 72 89 France
 V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile
 V4 0 0 10 0 1 1 4 February Champagne 72 48 France
 V9 0 0 10 0 0 0 9 April Chardonnay 144 57 Chile
 V14 0 0 9 0 0 0 14 June Merlot 72 64 Chile

Cluster #4 consists of a single customer, Parker, who is isolated in the graph
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V11 0 0 5 1 1 6 11 May Champagne 72 85 France
 V16 1 0 3 1 0 0 16 June Merlot 72 88 California
 V20 0 0 5 1 0 0 20 August Cabernet Sauvignon 72 82 Italy
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V31 0 0 14 1 1 1 31 December Champagne 72 89 France
 V1 0 0 5 0 0 5 1 January Malbec 72 56 France

Cluster #5 dominated by Espumante 
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa
 V3 0 0 4 0 2 0 3 February Espumante 144 32 Oregon
 V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon
 V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France
 V4 0 0 10 0 1 1 4 February Champagne 72 48 France
 V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile

 Cluster #6 likes French wines
 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin
 V11 0 0 5 1 1 6 11 May Champagne 72 85 France
 V22 0 0 14 0 1 6 22 August Champagne 72 63 France
 V1 0 0 5 0 0 5 1 January Malbec 72 56 France
 V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France
 V28 0 1 1 0 0 4 28 November Cabernet Sauvignon 12 56 France
 V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia

Conclusion
The graph analysis package, igraph, is a powerful resource, well capable of implementing and scaling up the modularity maximization process developed by the author using Excel in Data Smart.

Gerard

@PugData

Leave a Reply

Your email address will not be published.