scikit-learn
scikit-learn copied to clipboard
[MRG] GaussianMixture with BIC/AIC
Reference Issues/PRs
Fixes #19338. Automates the selection in Gaussian Mixture Model Selection. Adds a basic GaussianMixtureIC estimator without initializing with agglomerative clustering as discussed in #19562.
What does this implement/fix? Explain your changes.
It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:
-
Covariance constraints
-
Number of components
Any other comments?
✔️ Linting Passed
All linting checks passed. Your pull request is in excellent shape! ☀️
@amueller @NicolasHug What do you think?
@jjerphan @ogrisel This PR is designed to greatly simplify user's life when trying to do something like in this tutorial: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html, and is highly simplified relative to other recently proposed PRs, such as https://github.com/scikit-learn/scikit-learn/pull/19562. We are grateful for your feedback, and look forward to finalizing this.
@adam2392 Hey Adam - Curious to hear what you think about this?
@adam2392 Hey Adam - Curious to hear what you think about this?
Hello Adam @adam2392,
We have reviewed and addressed all feedback provided here. Below is a summary of the primary comments and our responses:
- "significantly richer init capabilities than the underlying GaussianMixture class in scikit-learn": currently,
GaussianMixtureICselects models only based on covariance constraints and the number of components, the same as the GM model selection example in the library. - "orders of magnitude slower than GaussianMixture": using current
GaussianMixtureICto run the same task as in the example does not take more time. On my computer, both fitting procedures took around 1.3s. - "to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions": all checks, including the tests related to
check_estimator, have passed. - "joblib-based multi-threading": it is no longer in the code.
- "regularization": also not included in the code anymore.
With these revisions, we believe the code is ready for the next review. Please let us know if there is anything further we should adjust. Thank you very much for your time and insights!
Hey @tingshanL okay thanks!
I think this will require a discussion among some of the more senior maintainers on the team. Personally, I do see the use of a mclust like algorithm within Python/sklearn ecosystem, since I have used it in the past in R. However, we'll have to see what others say… I know they're busy, and including new models is not super easy, so thank you for the patience.
Out of curiosity (perhaps you can just summarize in the PR description if we want to reserve the space for other discussion), how does this differ from mclust? If there are significant differences, what would need to be done to make this 1-1 matching? This is just some high level info I'm curious on to guide the discussion, so feel free to not spend too much time addressing this question.
Hello @adam2392! Thanks again for your earlier feedback on this PR.
I’ve updated it so that GaussianMixtureIC remains a small, focused estimator for automatic model selection, but now also incorporates a more robust initialization strategy:
GaussianMixtureICstill automates the existing example: it runs a grid search overn_componentsandcovariance_typeand selects the bestGaussianMixtureby AIC/BIC.- It now uses an internal Mahalanobis–Ward initialization for the underlying
GaussianMixture, while keeping the publicGaussianMixtureAPI unchanged. This initialization has been effective on anisotropic mixtures in our AutoGMM work (Liu et al., arXiv:1909.02688), and here we include a minimal version tailored toGaussianMixtureIC. examples/mixture/plot_gmm_selection.pyhas an extra “crossing double-cigar” section that visually and quantitatively (ARI) contrastsGaussianMixturevsGaussianMixtureICto illustrate the effect of the initialization.
Regarding your question about mclust: conceptually, GaussianMixtureIC is mclust-style in that it fits Gaussian mixtures over a grid of numbers of components and covariance types and selects by BIC/AIC, but it is intentionally narrower. To move closer to a 1-to-1 mclust match, a larger follow-up effort would likely involve
- extending the core
GaussianMixtureimplementation beyond the four covariance types currently supported ("spherical","diag","tied","full"); - adding more model-based hierarchical clustering schemes (the Mahalanobis–Ward step in this PR is a small, model-aware initialization in that spirit);
- incorporating eigenvalue-based covariance regularization similar to what we experimented with in AutoGMM.
For this PR I’ve focused on a smaller, additive step that stays within the existing GaussianMixture API, but I’m very open to suggestions if you think a closer mclust-style feature set would be more appropriate.