dirichletprocess
dirichletprocess copied to clipboard
Different number of clusters of two converged dp model objects trained on same dataset?
Hi @dm13450, I was using the dirichletprocess package twice on the same data set, I called the two objects dp, and dp2 (see below). I checked convergence following your approach on your blog. All parameters seemed to converge with Gelman Rubin Diagnostics well below 1.1. The summary of the two objects shows a mean number of clusters of 5.26 and 5.22, respectively. However, when I print out the number of clusters per object, it shows that dp has 5 clusters (which I would expect from the average of 5.26), but it also shows that dp2 has only 2 clusters. Can you explain why that is?
print(dp)
Dirichlet process object run for 5000 iterations.
Mixing distribution mvnormal Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2 Alpha Prior parameters 2, 4 Conjugacy conjugate Sample size 28
Mean number of clusters 5.26 Median alpha 0.84
dp2
Dirichlet process object run for 5000 iterations.
Mixing distribution mvnormal Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2 Alpha Prior parameters 2, 4 Conjugacy conjugate Sample size 28
Mean number of clusters 5.22
Median alpha 0.83
dp$numberClusters
[1] 5
dp2$numberClusters
[1] 2`
Hey, thanks for using the package.
dp$numberClusters
is the number of clusters for the last iteration of the fitting process, so will vary, it's just one sample of the DP process. As its a Bayesian model, each data point will have a distribution of likely cluster assignment, rather than just one label.
Does that help?
Hi, thanks for your answer. So when I understood correctly, ``dp$numberClusters` refers only to the last iteration of the MCMC fit. Given that, how can I determine the most likely cluster label for a given data point? In your blog about "Dirichlet Process Cluster Probabilities," you use
numClusters <- dp$numberClusters
clusterLabelProbs <- matrix(nrow=nrow(faithfulTrans), ncol=numClusters + 1)
to determine the required number of columns. As you see in my case, this won't work since the number of clusters strongly varies between iterations.
Does it make sense that when I have n-iterations for my cluster labels contained in the dp$labelsChain, to extract always i-th element, for the i-th data point and use the most frequently occurring cluster assignment?
Something like:
library(purrr) # includes the map function to extract ith element from a list of lists
labels1 <- list()
for(i in 1:length(dp$clusterLabels)){
labels1[[as.character(i)]] <- as.numeric(rownames(as.matrix(sort(table(unlist(map(dp$labelsChain, i))),
decreasing = T)))[1])
unlist(labels1)
Moreover, I have a similar problem with the ClusterLabelPredict
function.
When I apply it twice to the same newData
and the same fitted dp-object I get varying results. I assume this is also because it is a random draw from the posterior. However, when I repeat the prediction several times and take the most frequent clusterLabel prediction is stabilized, something like:
## Generating prediction for first two rows of the training set
prediction1 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,])))
prediction2 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,])))
´## Extracting Frequencies for newData item 2`
t1 <- sort(table(unlist(map(map(prediction1, 1),2))), decreasing = T)[1]
t2 <- sort(table(unlist(map(map(prediction2, 1),2))), decreasing = T)[1]
Or is there any more convenient way to extract the most likely cluster label?
Yeah everything above is correct, you ClusterLabelPredict
is just one sample from the posterior and therefore need to do multiple draws, like you have.
Unfortunately, nothing more convenient, but your code looks fine.