tidyLPA icon indicating copy to clipboard operation
tidyLPA copied to clipboard

Consider implementing multiple starts for Mclust

Open cjvanlissa opened this issue 5 years ago • 10 comments

cjvanlissa avatar Dec 06 '19 08:12 cjvanlissa

I concur this is important to consider. One distinguishing feature of mclust is that the starts are determined through hierarchical clustering; does that approach (rather than random starts) have merit worth, too?

jrosen48 avatar Dec 11 '19 01:12 jrosen48

You're right; I don't know whether this prevents local maxima entirely, or whether you just end up consistently at the same (local) minumum. I'll read up on that

cjvanlissa avatar Dec 11 '19 07:12 cjvanlissa

This paper is relevant: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4776768/

It shows that the SVD-based MBHAC which Mclust uses by default to obtain starting values, leads to better partitioning than the multiple short runs implemented by Mplus. Thus, I think all we need to do is 1) ensure that SVD is always implemented, 2) communicate to users which method is used to ensure convergence to a global maximum, and 3) update the checklist criterion to focus on addressing convergence, rather than requiring the multiple short runs approach.

cjvanlissa avatar Dec 11 '19 07:12 cjvanlissa

Just catching this and agree it is both absolutely worth while to explore and that both #2 and #3 are the crucial aspects to convey to end users.

jhaltiga avatar Jan 06 '20 08:01 jhaltiga

some relevant discussion here; h/t @jhaltiga for engaging with the Muthens about this

jrosen48 avatar Jan 19 '20 17:01 jrosen48

This is related to the email I just sent you, @cjvanlissa

jrosen48 avatar Oct 14 '20 16:10 jrosen48

Is it possible that even with hierarchical clustering, the EM algorithm could still reach a different local maximum? My intuition is that it cannot, that it is determinative. But I may be wrong.

jrosen48 avatar Oct 14 '20 16:10 jrosen48

from that paper you refernenced @cjvanlissa:

The mixture model-based approach to clustering provides a firm statistical framework for unsupervised learning. The EM algorithm is typically used for parameter estimation. However, because of the many local maxima of the likelihood function, an important problem for getting sensible maximum likelihood estimates is to obtain reasonable starting points. A computationally efficient approach to EM initialisation is based on the partitions obtained from agglomerative hierarchical clustering.

jrosen48 avatar Oct 14 '20 16:10 jrosen48

sorry for all the messages - it seems like in certain cases - especially those involving coarse data with little variability - for different local maxima to be reached, and it has to do with how the hierarchical clustering resolves ties.

In mclust the EM algorithm is initialised using the partitions obtained from model-based hierarchical agglomerative clustering (MBHAC). In this approach, hierarchical clusters are obtained by recursively merging the two clusters that provide the smallest decrease in the classification likelihood for Gaussian mixture model (Banfield and Raftery, 1993). Efficient numerical algorithms have been discussed by Fraley (1998). Using MBHAC is particularly convenient because the underlying probabilistic model is shared by both the initialisation step and the model fitting step. Furthermore, MBHAC is also computationally advantageous because a single run provides the basis for initialising the EM algorithm for any number of mixture components and component-covariances parameterisations. Although there is no guarantee that the EM initialized by MBHAC will converge to the global optimum, it often provides reasonable starting points.

A problem with the MBHAC approach may arise in the presence of coarse data, resulting from the discrete nature of the data or from continuous data that are rounded when measured. In this case, ties must be broken by choosing the pair of entities that will be merged. This is often done at random, but regardless of which method is adopted for breaking ties, this choice can have important consequences because it changes the clustering of the remaining observations. Moreover, the final EM solution may depend on the ordering of the variables.

Consider the Flea beetles data available in package tour. As can be seen from Figure 12, the observed values are rounded (to the nearest integer presumably) and there is a large overplotting of points.

By reversing the order of the variables in the fit of mod2, the initial partitions differ due to ties in the data, so the EM algorithm converges to different solutions of the same EEE model with 5 components. The second solution has a higher BIC and better accuracy.

In situations like this we may want to assess the stability of results by randomly starting the EM algorithm. The function randomPairs() may be called to obtain a random hierarchical structure suitable to be used as initial clustering partition.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/

jrosen48 avatar Oct 14 '20 16:10 jrosen48

@joshuarosenberg I think you cracked it. This is likely the main contributor to the cases of non-replication reported by users, as they all conform to the description of "coarse data with little variability". Moreover, I cannot exclude the possibility that randomness is introduced at different stages by the estimator. This could also lead to different solutions, but again, predominantly in the case where the data are ill-suited to clustering in the first place.

cjvanlissa avatar Oct 15 '20 08:10 cjvanlissa