dask-ml
dask-ml copied to clipboard
[WIP] Coreset for Kmeans and GaussianMixture clustering
Introduce a Coreset Meta Estimator, that samples a subset of the original data (keeping the original geometrical shape) and pass this sample to a scikit-learn estimator
TODO
- [x] unit tests
- [x] documentation
- [x] .fit() returns the scikit-learn estimator, make it returning
self(the dask meta estimator) - [ ] validate with plot and comparisons with other clustering algorithms, as in this page from scikit-learn)
- [ ] compare Coreset + sklearn.KMeans with dask_ml.KMeans
- [ ] check behaviour with
refitparam in GridSearchCV - [ ] try different coreset sizes for hard VS soft clustering, as mentionned in Section 6
Currently I made this comparison, taking this sklearn doc as a baseline

The original data has 150k rows and 2 features, 200 points are sampled with the Coreset class
Notes I experienced very poor running time performances with small chunks=(100, 2) when building dask.array The current graph show running time with a chunk size of (10_000, 2)
Here is a global view of the learning process

Thanks for continuing to work on this.
I experienced very poor running time performances with small chunks=(100, 2) when building dask.array
Depending on the scheduler there's a roughly 10-200 microsecond per task overhead (https://docs.dask.org/en/latest/institutional-faq.html?highlight=overhead#how-well-does-dask-scale-what-are-dask-s-limitations).
Can you think of anyone from the scikit-learn side who would be able to review this? I'm not too familiar with the coreset idea / Gaussian mixing. When you're ready, I can take a look at things from the Dask side.
@TomAugspurger my pleasure. Concerning the review from the sklearn side, I think that Jérémie du Boisberranger worked on a set of runtime/memory benchmarks for KMeans, whose results have been presented at the 2019 scikit-learn consortium.
Concerning interactions with the sklearn.GaussianMixture, I guess we can ask @gmaze, who initially raised the issue on adding Gaussian Mixture Models. He certainly is more informed than I am on how we should validate that our Coreset class can be used (with care ?) with GaussianMixture. My only intuition is that the sklearn.GaussianMixture uses KMeans internally to init the cluster centers (see the init_params parameter). The plot at the top of this page visually confirms that the Coreset(GaussianMixture) instance succeeds in clustering the flattened forms in the middle, where KMeans fails.
Also, I would need help on how to properly compare runtime with sklearn. My current benchmark takes the sklearn.KMeans class as an opponent, but sklearn has a MiniBatchKMeans which is meant to be faster. This is to be sure that
Coreset(Kmeans()).fit()runs as fast asMiniBatchKmeansfor datasets that hold in memoryCoreset(Kmeans()).fit()can scale to out-of-memory datasets and still deliver good quality clustering