R and Hierarchical Agglomerative Clustering

January 10, 2012 at 9:53 pm Leave a comment

Struggling for the past few hours to draw a dendrogram for a given similarity matrix. Having searched through various cluster analysis software, I finally decided to go ahead with R. Most of the cluster analysis software available on the web start with the initial data and generate either a similarity matrix or distance matrix which is stored as object in memory. Finally, they use the similarity matrix or distance matrix to generate clusters.

Since I am using a different method for calculating the similarity, the given similarity measures in cluster analysis software were of no use to me. FYI, I am trying to discover latent semantic structure between terms. For that, I am using SVD as a measure to generate similarity between terms.

I have a similarity matrix that consist of terms as rows and terms as columns, which looks like follows:

J    G   A   T   I
J   1,-0.12,0.84,0,0.19
G,-0.12,1,0.43,0,0.95
A,0.84,0.43,1,0,0.69
T,0,0,0,1,0
I,0.19,0.95,0.69,0,1

R is an excellent tool that allows me to input a similarity matrix, then convert it into distance matrix. The distance matrix is further used for generating clusters which can be plotted using dendrogram, quick and easy. The following R script achieve the objective.

A<-matrix(c(1,-0.12,0.84,0,0.19,-0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,

0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)

dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),c(“java”,”game”,

“application”,”travel”,”iphone”))
sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
D = sim2dist(A)
hc = hclust(D)
plot(hc)

The output looks like follows:

HAC Clustering

We can also use agnes package for HAC and kmeans

A<-matrix(c(1,-0.12,0.84,0,0.19,-
0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,

0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)
dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),
c(“java”,”game”,”application”,”travel”,”iphone”))
#sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
#D = sim2dist(A)
#Using hclus for HAC
#hc = hclust(D)
#plot(hc)

#using agnes for HAC
#hc <- agnes(A,diss=FALSE,metric=”euclidean”,stand=FALSE,method=”single”)
#print(hc)
#plot(hc,ask=FALSE,which.plots=NULL)

#using kmeans for clusering
km<-kmeans(A,3,15)
print(km)
plot(x, col=km$cluster)
points(km$centers,col=1:2,pch=8)

Entry filed under: Clustering. Tags: , , , .

Creativity in Kids

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Recent Posts

Blog Stats

  • 9,462 hits

Top Clicks

  • None

Follow

Get every new post delivered to your Inbox.