fast_pytorch_kmeans
fast_pytorch_kmeans copied to clipboard
Encounter Kmeans empty cluster
Reproduce
from fast_pytorch_kmeans import MultiKMeans
from collections import Counter
kmeans = MultiKMeans(n_clusters=50, mode='euclidean', verbose=1)
x = torch.randn(1000, 200, 3, device='cuda')
labels = kmeans.fit_predict(x)
print(Counter([len(set(labels[i].tolist())) for i in range(labels.shape[0])]))
Counter({47: 176,
48: 170,
45: 121,
50: 130,
46: 153,
44: 71,
49: 119,
42: 16,
43: 34,
41: 7,
39: 1,
40: 2})
It appears that when the number of cluster approximate a certain percentage of the sample size, the empty cluster is very likely to appear. Current implemenation does not handle well these empty clusters.
Hi, I could not reproduce your results exactly, even with 10000 parallel kmeans, I'm getting:
Counter({50: 9995, 49: 5})
Appreciate your reply.
That is weird. I am using version 0.2.0.1
and both cpu
and cuda
gave me similar results as I posted.
some additional environment info:
x86_64 GNU/Linux
Ubuntu 22.04.2 LTS
Python 3.9.16
torch 1.13.1+cu117
Can you confirm you are using the exact same code you provided in your first comment? if that's the case I'm not sure what could cause this to happen.
I'm using python 3.11, torch 2.1.1+cu121
Sorry for the late reply. I am exactly using the demo code I provided. I also tried torch 2.1.1, but the behavior is still the same.