scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

[MRG] GaussianMixture with BIC/AIC

Open tingshanL opened this issue 2 years ago • 4 comments
trafficstars

Reference Issues/PRs

Fixes #19338. Automates the selection in Gaussian Mixture Model Selection. Adds a basic GaussianMixtureIC estimator without initializing with agglomerative clustering as discussed in #19562.

What does this implement/fix? Explain your changes.

It automatically selects the best GM model based on BIC or AIC among a set of models that are parameterized by:

  • Covariance constraints

  • Number of components

Any other comments?

tingshanL avatar Jun 30 '23 00:06 tingshanL

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3ea60f2. Link to the linter CI: here

github-actions[bot] avatar Jun 30 '23 00:06 github-actions[bot]

@amueller @NicolasHug What do you think?

jovo avatar Jul 23 '24 00:07 jovo

@jjerphan @ogrisel This PR is designed to greatly simplify user's life when trying to do something like in this tutorial: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html, and is highly simplified relative to other recently proposed PRs, such as https://github.com/scikit-learn/scikit-learn/pull/19562. We are grateful for your feedback, and look forward to finalizing this.

jovo avatar Jul 30 '24 19:07 jovo

@adam2392 Hey Adam - Curious to hear what you think about this?

jovo avatar Aug 13 '24 20:08 jovo

@adam2392 Hey Adam - Curious to hear what you think about this?

Hello Adam @adam2392,

We have reviewed and addressed all feedback provided here. Below is a summary of the primary comments and our responses:

  • "significantly richer init capabilities than the underlying GaussianMixture class in scikit-learn": currently, GaussianMixtureIC selects models only based on covariance constraints and the number of components, the same as the GM model selection example in the library.
  • "orders of magnitude slower than GaussianMixture": using current GaussianMixtureIC to run the same task as in the example does not take more time. On my computer, both fitting procedures took around 1.3s.
  • "to configure continuous integration to run the tests, including a test to run the check_estimator function to make sure that the code stays compatible with future scikit-learn versions": all checks, including the tests related to check_estimator, have passed.
  • "joblib-based multi-threading": it is no longer in the code.
  • "regularization": also not included in the code anymore.

With these revisions, we believe the code is ready for the next review. Please let us know if there is anything further we should adjust. Thank you very much for your time and insights!

tingshanL avatar Oct 30 '24 23:10 tingshanL

Hey @tingshanL okay thanks!

I think this will require a discussion among some of the more senior maintainers on the team. Personally, I do see the use of a mclust like algorithm within Python/sklearn ecosystem, since I have used it in the past in R. However, we'll have to see what others say… I know they're busy, and including new models is not super easy, so thank you for the patience.

Out of curiosity (perhaps you can just summarize in the PR description if we want to reserve the space for other discussion), how does this differ from mclust? If there are significant differences, what would need to be done to make this 1-1 matching? This is just some high level info I'm curious on to guide the discussion, so feel free to not spend too much time addressing this question.

adam2392 avatar Oct 31 '24 00:10 adam2392

Hello @adam2392! Thanks again for your earlier feedback on this PR.

I’ve updated it so that GaussianMixtureIC remains a small, focused estimator for automatic model selection, but now also incorporates a more robust initialization strategy:

  • GaussianMixtureIC still automates the existing example: it runs a grid search over n_components and covariance_type and selects the best GaussianMixture by AIC/BIC.
  • It now uses an internal Mahalanobis–Ward initialization for the underlying GaussianMixture, while keeping the public GaussianMixture API unchanged. This initialization has been effective on anisotropic mixtures in our AutoGMM work (Liu et al., arXiv:1909.02688), and here we include a minimal version tailored to GaussianMixtureIC.
  • examples/mixture/plot_gmm_selection.py has an extra “crossing double-cigar” section that visually and quantitatively (ARI) contrasts GaussianMixture vs GaussianMixtureIC to illustrate the effect of the initialization.

Regarding your question about mclust: conceptually, GaussianMixtureIC is mclust-style in that it fits Gaussian mixtures over a grid of numbers of components and covariance types and selects by BIC/AIC, but it is intentionally narrower. To move closer to a 1-to-1 mclust match, a larger follow-up effort would likely involve

  • extending the core GaussianMixture implementation beyond the four covariance types currently supported ("spherical", "diag", "tied", "full");
  • adding more model-based hierarchical clustering schemes (the Mahalanobis–Ward step in this PR is a small, model-aware initialization in that spirit);
  • incorporating eigenvalue-based covariance regularization similar to what we experimented with in AutoGMM.

For this PR I’ve focused on a smaller, additive step that stays within the existing GaussianMixture API, but I’m very open to suggestions if you think a closer mclust-style feature set would be more appropriate.

tingshanL avatar Nov 17 '25 14:11 tingshanL