quanteda.textmodels icon indicating copy to clipboard operation
quanteda.textmodels copied to clipboard

Document clustering in quanteda?

Open jrosen48 opened this issue 9 years ago • 1 comments

I'm interested in very basic document clustering. Is there a plan to include document clustering in quanteda - or is this considered outside the scope of the (so excellent) package?

I ask in part because, as there does not appear to be a package for document clustering and it is somewhat more complicated than just passing a document-term matrix to a clustering function, I am working on a [package for document clustering, cluster-compare-text],(https://github.com/jrosen48/cluster-compare-text) that uses quanteda.

jrosen48 avatar Mar 13 '16 14:03 jrosen48

Hi - This would be a great addition. I’ve been encouraging some potential contributors to consider designing companion packages that require quanteda, and would be very happy to assist with this project.

Most clustering can be done directly from a dfm. For instance, see the code below I used last week when I taught clustering in my quantitative text analysis class. If I were adding methods for clustering, I would define these methods for dfm objects, and create a new output class of special cluster, for which additional methods such as plot, summary, and extractor functions were defined. This could include prediction or fit methods if you want to supply validation codes as you describe on your project web page, assuming some tags could be supplied externally for each item.

What clustering methods (in addition to for instance k-means and hierarchical clustering, as shown below) did you have in mind?

Ken

## Code for examples from Week 8, clustering
## Ken Benoit <[email protected]>

require(quanteda)

## k-means clustering
data(SOTUCorpus, package = "quantedaData")
presDfm <- dfm(subset(SOTUCorpus, Date > as.Date("1960-01-01")), 
               ignoredFeatures = stopwords("english"), stem = TRUE)

presDfm <- trim(presDfm, minCount = 5, minDoc = 3)
# try default guidelines
k <- round(sqrt(ndoc(presDfm)/2))
clusterk5 <- kmeans(tf(presDfm, "prop"), k)
split(docnames(presDfm), clusterk5$cluster)

clusterk3 <- kmeans(tf(presDfm, "prop"), 3)
split(docnames(presDfm), clusterk3$cluster)

clusterk2 <- kmeans(tf(presDfm, "prop"), 2)
split(docnames(presDfm), clusterk2$cluster)


## hierarchical clustering
# get distances on normalised dfm
presDistMat <- dist(as.matrix(weight(presDfm, "relFreq")))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster)

## hierarchical clustering on words
# weight by relative term frequency
wordDfm <- sort(tf(presDfm, "prop"))  # sort in decreasing order of total word freq
wordDfm <- t(wordDfm)[1:100, ]  
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, labels = docnames(wordDfm),
     xlab="", main="Relative Term Frequency weighting")

# repeat without word "will"
wordDfm <- removeFeatures(wordDfm, "will")
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, labels = docnames(wordDfm), 
     xlab="", main="Relative Term Frequency without \"will\"")

# with tf-idf weighting
wordDfm <- sort(tf(presDfm, "prop"))  # sort in decreasing order of total word freq
wordDfm <- removeFeatures(wordDfm, c("will", "year", "s"))
wordDfm <- t(wordDfm)[1:100,]  # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, labels = docnames(wordDfm),
     xlab="", main="tf-idf Frequency weighting")

# a different representation
wordCluster2 <- as.dendrogram(wordCluster)
plot(wordCluster2)
# Color the branches using color_branches() from 'dendextend' package
require(dendextend)
myColorBranch <- color_branches(wordCluster2, k=5)
plot(myColorBranch)
# To color also labels, plug the result of coloring branches to color_labels()
myAllColor <- color_labels(myColorBranch, k=2)
plot(myAllColor)

kbenoit avatar Mar 15 '16 13:03 kbenoit