php-kmeans icon indicating copy to clipboard operation
php-kmeans copied to clipboard

Performance improvement

Open DonaldTrump88 opened this issue 3 years ago • 4 comments

I am doing clustering of about 50K locations. Each cluster should have about 20 or less locations. Unfortunately it takes about 1 hour to finish the algorithm. My initial guess says that repeated distance calculation makes it slow, if I add the correct distance formula based on LatLong it will be slower. If you also think so then adding distance matrix will be help to optimize it. Here is similar example in DBScan. https://github.com/bhavikm/DBSCAN-clustering/blob/master/index.php The matrix calculation can be done when user calls solve.

DonaldTrump88 avatar May 19 '21 12:05 DonaldTrump88

I have read few positive reviews of mini batch clustering. https://papers.nips.cc/paper/2016/file/8d317bdcf4aafcfc22149d77babee96d-Paper.pdf https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

DonaldTrump88 avatar May 19 '21 14:05 DonaldTrump88

Thanks @Ninja-007, I'll give it a look. If you have suggestions for implementation feel free to start a PR

bdelespierre avatar May 20 '21 10:05 bdelespierre