Clustering.jl
Clustering.jl copied to clipboard
k-means out of memory error on large data sets
I'm looking to switch to Julia for my k-means clustering needs. However, I'm regularly using k-means on three-dimensional data sets with 500,000 data points on average. Typically I use k-means to identify 10% or roughly 50,000 clusters. I am unable to run this as it receives an out of memory error on a machine with 64 gb of ram. Is there a way around this, or should I just develop my own k-means implementation in Julia for high performance?
You can try to convert your data to Float32
to reduce memory footprint.
are there any plans to provide a minibatch version such as https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html ?
@jjlynch2 the memory problem you mention happens because the implementation stores a 500,000x50,000 distance matrix when using pairwise
. I am interested in making a PR to avoid this. For each datapoint of the 500,000 I think we only need its closest centroid at every iteration, there is no need to keep the distance from every datapoint to all centroids. Doing this would reduce from storing 50,000 x 500,000 to storing 50,000 x 1 numbers during learning.
Ideally it would be very useful to have the option to define a backend implementation when fitting the K-means so that users could opt to different implementations (maybe you care a lot about memory but not that much speed, maybe you want to maximize speed even if at a higher memory cost etc).