fast_pytorch_kmeans icon indicating copy to clipboard operation
fast_pytorch_kmeans copied to clipboard

Encounter Kmeans empty cluster

Open NatureGeorge opened this issue 1 year ago • 4 comments

Reproduce

from fast_pytorch_kmeans import MultiKMeans
from collections import Counter

kmeans = MultiKMeans(n_clusters=50, mode='euclidean', verbose=1)
x = torch.randn(1000, 200, 3, device='cuda')
labels = kmeans.fit_predict(x)

print(Counter([len(set(labels[i].tolist())) for i in range(labels.shape[0])]))
Counter({47: 176,
         48: 170,
         45: 121,
         50: 130,
         46: 153,
         44: 71,
         49: 119,
         42: 16,
         43: 34,
         41: 7,
         39: 1,
         40: 2})

It appears that when the number of cluster approximate a certain percentage of the sample size, the empty cluster is very likely to appear. Current implemenation does not handle well these empty clusters.

NatureGeorge avatar Nov 26 '23 03:11 NatureGeorge

Hi, I could not reproduce your results exactly, even with 10000 parallel kmeans, I'm getting:

Counter({50: 9995, 49: 5})

DeMoriarty avatar Dec 06 '23 23:12 DeMoriarty

Appreciate your reply.

That is weird. I am using version 0.2.0.1 and both cpu and cuda gave me similar results as I posted.

some additional environment info:

x86_64 GNU/Linux
Ubuntu 22.04.2 LTS
Python 3.9.16
torch 1.13.1+cu117

NatureGeorge avatar Dec 07 '23 06:12 NatureGeorge

Can you confirm you are using the exact same code you provided in your first comment? if that's the case I'm not sure what could cause this to happen.

I'm using python 3.11, torch 2.1.1+cu121

DeMoriarty avatar Dec 12 '23 23:12 DeMoriarty

Sorry for the late reply. I am exactly using the demo code I provided. I also tried torch 2.1.1, but the behavior is still the same.

NatureGeorge avatar Jan 15 '24 12:01 NatureGeorge