Clustering.jl icon indicating copy to clipboard operation
Clustering.jl copied to clipboard

k-means out of memory error on large data sets

Open jjlynch2 opened this issue 5 years ago • 4 comments

I'm looking to switch to Julia for my k-means clustering needs. However, I'm regularly using k-means on three-dimensional data sets with 500,000 data points on average. Typically I use k-means to identify 10% or roughly 50,000 clusters. I am unable to run this as it receives an out of memory error on a machine with 64 gb of ram. Is there a way around this, or should I just develop my own k-means implementation in Julia for high performance?

jjlynch2 avatar Oct 30 '19 03:10 jjlynch2

You can try to convert your data to Float32 to reduce memory footprint.

wildart avatar Nov 12 '19 17:11 wildart

are there any plans to provide a minibatch version such as https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html ?

davidbp avatar Oct 28 '22 15:10 davidbp

@jjlynch2 the memory problem you mention happens because the implementation stores a 500,000x50,000 distance matrix when using pairwise. I am interested in making a PR to avoid this. For each datapoint of the 500,000 I think we only need its closest centroid at every iteration, there is no need to keep the distance from every datapoint to all centroids. Doing this would reduce from storing 50,000 x 500,000 to storing 50,000 x 1 numbers during learning.

Ideally it would be very useful to have the option to define a backend implementation when fitting the K-means so that users could opt to different implementations (maybe you care a lot about memory but not that much speed, maybe you want to maximize speed even if at a higher memory cost etc).

davidbp avatar Apr 12 '23 09:04 davidbp