fast_pytorch_kmeans Encounter Kmeans empty cluster

Encounter Kmeans empty cluster

Open NatureGeorge opened this issue 1 year ago • 4 comments

Reproduce

from fast_pytorch_kmeans import MultiKMeans
from collections import Counter

kmeans = MultiKMeans(n_clusters=50, mode='euclidean', verbose=1)
x = torch.randn(1000, 200, 3, device='cuda')
labels = kmeans.fit_predict(x)

print(Counter([len(set(labels[i].tolist())) for i in range(labels.shape[0])]))

Counter({47: 176,
         48: 170,
         45: 121,
         50: 130,
         46: 153,
         44: 71,
         49: 119,
         42: 16,
         43: 34,
         41: 7,
         39: 1,
         40: 2})

It appears that when the number of cluster approximate a certain percentage of the sample size, the empty cluster is very likely to appear. Current implemenation does not handle well these empty clusters.

Nov 26 '23 03:11 NatureGeorge

Hi, I could not reproduce your results exactly, even with 10000 parallel kmeans, I'm getting:

Counter({50: 9995, 49: 5})

Dec 06 '23 23:12 DeMoriarty

Appreciate your reply.

That is weird. I am using version 0.2.0.1 and both cpu and cuda gave me similar results as I posted.

some additional environment info:

x86_64 GNU/Linux
Ubuntu 22.04.2 LTS
Python 3.9.16
torch 1.13.1+cu117

Dec 07 '23 06:12 NatureGeorge

Can you confirm you are using the exact same code you provided in your first comment? if that's the case I'm not sure what could cause this to happen.

I'm using python 3.11, torch 2.1.1+cu121

Dec 12 '23 23:12 DeMoriarty

Sorry for the late reply. I am exactly using the demo code I provided. I also tried torch 2.1.1, but the behavior is still the same.

Jan 15 '24 12:01 NatureGeorge

fast_pytorch_kmeans fast_pytorch_kmeans copied to clipboard

Encounter Kmeans empty cluster

Reproduce

fast_pytorch_kmeans
fast_pytorch_kmeans copied to clipboard