scikit-learn-extra icon indicating copy to clipboard operation
scikit-learn-extra copied to clipboard

Reduce memory requirements in KMedoids

Open rth opened this issue 5 years ago • 6 comments

KMedoids currently pre-computes a full distance matrix with pairwise_distances resulting in large memory usage making it unsuitable for datasets with more than 20-50k samples.

To improve the situation somewhat, following approaches could be possible,

  • [ ] use pairwise_distances_chunked
  • [ ] makes sure that for float32 input the distance matrix is also 32 bit.
  • [ ] investigate re-computing distance in each iterations (https://github.com/scikit-learn-contrib/scikit-learn-extra/pull/12#discussion_r307731034). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.

rth avatar Jul 29 '19 12:07 rth

Too good idea to be not implemented. Especially it worth for binary data clustering with manhattan distance, e.g..

tooHotSpot avatar May 10 '20 13:05 tooHotSpot

+1, I've hit this snag and having troubling thinking of a workaround (sampling from data a bit tough in my situation). Was hoping there was continue training via fit or such.

investigate re-computing distance in each iterations (#12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.

I vote worth it. Many cluster algos are super heavy on the training anyway for many rows. It's inference we care about usually. At the least, if it works for now as a holdover, more could use the package

lefnire avatar Aug 04 '20 04:08 lefnire

I think pairwise_distances_chunked might still hit the memory wall. You'd need to collect the intermediate lists over time, or get fancy with np.empty & setting slices. I tried both methods on my 75k rows and still get an error.

from sklearn.metrics import pairwise_distances_chunked
pdc = pairwise_distances_chunked(x, metric='cosine', working_memory=64)

# attempt 1
dists = np.vstack(pdc)  # vstack can work with generators

# attempt 2
dists = np.empty((x.shape[0], x.shape[0]))
for i, chunk in enumerate(pdc):
    sz = chunk.shape[0]
    start, stop = i*sz, (i+1)*sz
    dists[start:stop] = chunk
dists = np.empty((x.shape[0], x.shape[0]))
MemoryError: Unable to allocate 43.0 GiB for an array with shape (75970, 75970) and data type float64

I haven't sat with kmedoids code yet (hopefully will find time), maybe it doesn't need the n*n after pdist and can be reduced in each chunk iteration.

lefnire avatar Aug 04 '20 18:08 lefnire

pairwise_distances_chunked is useful where there is a "reduce operation" following the distance computation, such that the final memory required is smaller than n^2.

jnothman avatar Aug 04 '20 20:08 jnothman

We could also implement CLARA which does an aggregation of PAM on sub-samples (see wikipedia) this is an approximate algorithm and there is a risk that the result would be not as good but at least it will be tractable, the algorithm is a trade-off between efficiency and complexity (there is a parameter to tune this trade-off). This could easily be implemented as yet another method for KMedoids. Ref : Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2. This algorithm is already implemented in R, Matlab, ELKI...

Another possibility is this article which should not decrease the efficiency of KMedoids from what I understood but may save time/space, this would be more experimental as the method is much more recent and may be for later.

TimotheeMathieu avatar Nov 23 '20 18:11 TimotheeMathieu

K-Medoids will need all distances several times. So unless you have a very cheap distance that you can afford to compute several times you will likely be trading memory vs. even heavier CPU usage. There are some variants (e.g., CLARA, BanditPAM) that aim at this use case where you do not want to compute all pairwise distances.

kno10 avatar Feb 02 '21 12:02 kno10