R and Hierarchical Agglomerative Clustering
January 10, 2012 at 9:53 pm Leave a comment
Struggling for the past few hours to draw a dendrogram for a given similarity matrix. Having searched through various cluster analysis software, I finally decided to go ahead with R. Most of the cluster analysis software available on the web start with the initial data and generate either a similarity matrix or distance matrix which is stored as object in memory. Finally, they use the similarity matrix or distance matrix to generate clusters.
Since I am using a different method for calculating the similarity, the given similarity measures in cluster analysis software were of no use to me. FYI, I am trying to discover latent semantic structure between terms. For that, I am using SVD as a measure to generate similarity between terms.
I have a similarity matrix that consist of terms as rows and terms as columns, which looks like follows:
J G A T I
J 1,-0.12,0.84,0,0.19
G,-0.12,1,0.43,0,0.95
A,0.84,0.43,1,0,0.69
T,0,0,0,1,0
I,0.19,0.95,0.69,0,1
R is an excellent tool that allows me to input a similarity matrix, then convert it into distance matrix. The distance matrix is further used for generating clusters which can be plotted using dendrogram, quick and easy. The following R script achieve the objective.
A<-matrix(c(1,-0.12,0.84,0,0.19,-0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,
0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)
dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),c(“java”,”game”,
“application”,”travel”,”iphone”))
sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
D = sim2dist(A)
hc = hclust(D)
plot(hc)
The output looks like follows:
We can also use agnes package for HAC and kmeans
A<-matrix(c(1,-0.12,0.84,0,0.19,-
0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,
0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)
dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),
c(“java”,”game”,”application”,”travel”,”iphone”))
#sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
#D = sim2dist(A)
#Using hclus for HAC
#hc = hclust(D)
#plot(hc)
#using agnes for HAC
#hc <- agnes(A,diss=FALSE,metric=”euclidean”,stand=FALSE,method=”single”)
#print(hc)
#plot(hc,ask=FALSE,which.plots=NULL)
#using kmeans for clusering
km<-kmeans(A,3,15)
print(km)
plot(x, col=km$cluster)
points(km$centers,col=1:2,pch=8)
Entry filed under: Clustering. Tags: clustering, HAC, R, similarity matrix.

Trackback this post | Subscribe to the comments via RSS Feed