## Executive Summary

Chapter 5 of John Foreman‘s book Data Smart looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.

We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages. The modularity score returned, 0.549, is marginally better than the value of 0.546 calculated using the Excel-based process and is the same as that calculated in the book using Gephi, the well-known open source graph analysis tool.

**Process Outline**

The same wine promotion dataset, as used in Chapter 2 to explain and develop the k-means clustering algorithms, is used here.

The author implements the following steps to perform community detection (cluster analysis) on the data in Excel:

- Prepare the data for graph analysis
- Calculate the cosine similarity in wine promotion purchases for each customer pairing. The customers form the graph vertices and the cosine similarity defines the edge connections in the graph
- Prune the graph using the r-Neighborhood technique, keeping only edges with similarity scores at or above the 80th percentile
- Calculate new scores for the inter-customer relationships, taking into account the expectations of their being connected had the graph been connected randomly
- Calculate graph modularity arising from cluster assignments of the customers. Adjust cluster assignments to maximize graph modularity.

**A note on graph modularity**

Graph modularity works by assessing a probability-of-connection score for each pair of nodes in a network, taking into account the degree (number of connections) of each node in the graph. When clusters are formed, the graph modularity score is incremented for each pair within a cluster that has a connection between them; the modularity score is penalized for each pair within a cluster that has no connection between them. The process of modularity maximization adjusts the number of clusters and the cluster assignments to find the optimum graph modularity score.

**The R Code**

# R script to implement Community Detection on the wine dataset from Chapter 5 of the book Data Smart, by John Foreman

# read in the wine data, convert the data to a matrix, calculate similarities between vertices wineData <- read.csv(".\\WineNetwork.csv", header = TRUE, check.names = FALSE) wineNW <- wineData[, -c(1,2,3,4,5,6,7)] # remove the metadata columns wineNW[is.na(wineNW)] <- 0 # replace NULL entries with 0 wineNW_mat <- as.matrix(wineNW)# convert the data.frame to a matrix cosineSim_mat <- cosine(wineNW_mat) # calculate the cosine similarities of all the customersdiag(cosineSim_mat) <- 0 # assign self-similarities *the diagonal in the matrix) to be of value 0

# prune the data using the r-neighborhood approach, create the graph object, plot the graph cosineSim_mat[(cosineSim_mat < 0.5)] <-0 # prune the graph. Similarities of strength <0.5 are set to 0 cosineSim_mat[(cosineSim_mat >= 0.5)] <-1 cosineSim_graph <- graph.adjacency(cosineSim_mat,mode = 'undirected', weighted = TRUE) # convert the matrix to an igraph objectplot(cosineSim_graph)

**Fig. 1 A view of the pruned graph before modularity maximization is applied**

# create a matrix of inter-vertex scores membershipVector <- numeric(100) # creates a vector of length 100. Each element is set equal to zero. modMatrix <- mod.matrix(cosineSim_graph, membership = membershipVector) # generates a matrix of inter-vertex scores # use the multilevel.community function in igraph to optimize the graph modularity wineMod_opt = multilevel.community(cosineSim_graph) # selects optimum cluster assignments wineMod_opt modularity(cosineSim_graph, membership =wineMod_opt$membership) # calculates final graph modularity for (i in 1:max(wineMod_opt$membership)){print(wineMod_opt$names[wineMod_opt$membership == i]) # print list of members belonging to cluster i } # prints out members by cluster assignment

Graph community structure calculated with the multi level algorithm

Number of communities (best split): 6

Modularity (best split): 0.5489076

Community #1

[1] “Anderson” “Bell” “Campbell” “Cook” “Cox” “Flores” “Gray”

[8] “Jenkins” “Johnson” “Moore” “Morris” “Phillips” “Rodriguez” “Russell”

[15] “Smith”

Community #2

[1] “Adams” “Bailey” “Brown” “Carter” “Cruz” “Diaz” “Green”

[8] “Hernandez” “Hill” “Hughes” “James” “King” “Lewis” “Murphy”

[15] “Myers” “Perez” “Perry” “Rivera” “Robinson” “Ross” “Stewart”

[22] “Taylor” “Walker” “Watson”

Community #3

[1] “Allen” “Baker” “Barnes” “Clark” “Cooper” “Evans” “Foster”

[8] “Garcia” “Gomez” “Gonzalez” “Harris” “Kelly” “Lee” “Long”

[15] “Lopez” “Miller” “Morales” “Nguyen” “Powell” “Price” “Ramirez”

[22] “Reed” “Reyes” “Richardson” “Roberts” “Rogers” “Sanchez” “Sanders”

[29] “Scott” “Thomas” “Thompson” “Turner” “Ward” “White” “Williams”

[36] “Wood” “Wright” “Young”

Community #4

[1] “Parker”

Community #5

[1] “Bennett” “Brooks” “Edwards” “Gutierrez” “Jones” “Morgan” “Nelson”

[8] “Ortiz” “Sullivan” “Torres” “Wilson”

Community #6

[1] “Butler” “Collins” “Davis” “Fisher” “Hall” “Howard” “Jackson” “Martin”

[9] “Martinez” “Mitchell” “Peterson”

# evaluate number of orders by cluster, for each offer sumOrdersByCluster <- (aggregate(t(wineNW),by = list(wineMod_opt$membership),sum)) wineConclusion <- cbind(t(sumOrdersByCluster[,-1]), wineData) # print out the top features each cluster for (i in 1:max(wineMod_opt$membership)){print(i)print(head(wineConclusion[order(wineConclusion[,i], decreasing =TRUE),1:12])) } Cluster #1 Likes Pinot Noir 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V24 12 0 0 0 0 0 24 September Pinot Noir 6 34 Italy V26 11 0 3 0 0 1 26 October Pinot Noir 144 83 Australia V17 7 0 0 0 0 0 17 July Pinot Noir 12 47 Germany V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia V16 1 0 3 1 0 0 16 June Merlot 72 88 California Cluster #2 prefers low volume purchases 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V30 0 15 3 0 1 3 30 December Malbec 6 54 France V7 0 14 5 0 0 0 7 March Prosecco 6 40 Australia V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa V13 0 5 0 0 1 0 13 May Merlot 6 43 Chile Cluster # 3 appears to like sparkling wines 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V22 0 0 14 0 1 6 22 August Champagne 72 63 France V31 0 0 14 1 1 1 31 December Champagne 72 89 France V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile V4 0 0 10 0 1 1 4 February Champagne 72 48 France V9 0 0 10 0 0 0 9 April Chardonnay 144 57 Chile V14 0 0 9 0 0 0 14 June Merlot 72 64 Chile Cluster #4 consists of a single customer, Parker, who is isolated in the graph 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V11 0 0 5 1 1 6 11 May Champagne 72 85 France V16 1 0 3 1 0 0 16 June Merlot 72 88 California V20 0 0 5 1 0 0 20 August Cabernet Sauvignon 72 82 Italy V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France V31 0 0 14 1 1 1 31 December Champagne 72 89 France V1 0 0 5 0 0 5 1 January Malbec 72 56 France Cluster #5 dominated by Espumante 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V8 0 7 2 0 11 0 8 March Espumante 6 45 South Africa V3 0 0 4 0 2 0 3 February Espumante 144 32 Oregon V18 0 11 1 0 2 0 18 July Espumante 6 50 Oregon V29 0 14 0 1 2 0 29 November Pinot Grigio 6 87 France V4 0 0 10 0 1 1 4 February Champagne 72 48 France V6 0 0 11 0 1 0 6 March Prosecco 144 86 Chile Cluster #6 likes French wines 1 2 3 4 5 6 Offer # Campaign Varietal Minimum Qty (kg) Discount (%) Origin V11 0 0 5 1 1 6 11 May Champagne 72 85 France V22 0 0 14 0 1 6 22 August Champagne 72 63 France V1 0 0 5 0 0 5 1 January Malbec 72 56 France V2 5 0 0 0 0 5 2 January Pinot Noir 72 17 France V28 0 1 1 0 0 4 28 November Cabernet Sauvignon 12 56 France V12 1 1 0 0 0 3 12 May Prosecco 72 83 Australia

**Conclusion**

The graph analysis package, igraph, is a powerful resource, well capable of implementing and scaling up the modularity maximization process developed by the author using Excel in Data Smart.

Gerard

@PugData