dask-ml [WIP] Coreset for Kmeans and GaussianMixture clustering

trafficstars

Introduce a Coreset Meta Estimator, that samples a subset of the original data (keeping the original geometrical shape) and pass this sample to a scikit-learn estimator

original paper

TODO

[x] unit tests
[x] documentation
[x] .fit() returns the scikit-learn estimator, make it returning self (the dask meta estimator)
[ ] validate with plot and comparisons with other clustering algorithms, as in this page from scikit-learn)
[ ] compare Coreset + sklearn.KMeans with dask_ml.KMeans
[ ] check behaviour with refit param in GridSearchCV
[ ] try different coreset sizes for hard VS soft clustering, as mentionned in Section 6

Feb 24 '21 15:02 remiadon

Currently I made this comparison, taking this sklearn doc as a baseline

Unknown-3

The original data has 150k rows and 2 features, 200 points are sampled with the Coreset class

Notes I experienced very poor running time performances with small chunks=(100, 2) when building dask.array The current graph show running time with a chunk size of (10_000, 2)

Feb 24 '21 15:02 remiadon

Here is a global view of the learning process

May 03 '21 17:05 remiadon

Thanks for continuing to work on this.

I experienced very poor running time performances with small chunks=(100, 2) when building dask.array

Depending on the scheduler there's a roughly 10-200 microsecond per task overhead (https://docs.dask.org/en/latest/institutional-faq.html?highlight=overhead#how-well-does-dask-scale-what-are-dask-s-limitations).

Can you think of anyone from the scikit-learn side who would be able to review this? I'm not too familiar with the coreset idea / Gaussian mixing. When you're ready, I can take a look at things from the Dask side.

May 09 '21 10:05 TomAugspurger

@TomAugspurger my pleasure. Concerning the review from the sklearn side, I think that Jérémie du Boisberranger worked on a set of runtime/memory benchmarks for KMeans, whose results have been presented at the 2019 scikit-learn consortium.

Concerning interactions with the sklearn.GaussianMixture, I guess we can ask @gmaze, who initially raised the issue on adding Gaussian Mixture Models. He certainly is more informed than I am on how we should validate that our Coreset class can be used (with care ?) with GaussianMixture. My only intuition is that the sklearn.GaussianMixture uses KMeans internally to init the cluster centers (see the init_params parameter). The plot at the top of this page visually confirms that the Coreset(GaussianMixture) instance succeeds in clustering the flattened forms in the middle, where KMeans fails.

Also, I would need help on how to properly compare runtime with sklearn. My current benchmark takes the sklearn.KMeans class as an opponent, but sklearn has a MiniBatchKMeans which is meant to be faster. This is to be sure that

Coreset(Kmeans()).fit() runs as fast as MiniBatchKmeans for datasets that hold in memory
Coreset(Kmeans()).fit() can scale to out-of-memory datasets and still deliver good quality clustering

May 13 '21 10:05 remiadon

dask-ml dask-ml copied to clipboard

[WIP] Coreset for Kmeans and GaussianMixture clustering

dask-ml
dask-ml copied to clipboard