fast-kmeans
fast-kmeans copied to clipboard
Sort-means appears to have a bug
Sort-means is giving slightly different results in some of my testing, compared to all the other algorithms. I have a test case where every other algorithm converges in 45 iterations, but it takes 42, and gives slightly different results (centers).
Steps to reproduce:
- run naive k-means and sort-means on the attached input, with k=10, using the library's k-means++ initialization, with a max iterations of 1000
- expected result: same number of iterations and centers
- actual result: different centers, different iterations
The initial centers chosen by k-means++ on this dataset are:
-0.379863 1.80735 0.658205 0.236417 1.52649
-0.456454 -0.44091 0.932621 1.4306 -0.77737
1.10611 -1.30305 -0.202388 -1.29549 0.150969
0.951702 0.885737 -0.0140388 -0.295967 -2.10619
2.52547 0.803819 1.57157 0.481766 0.168094
-0.35617 0.423922 0.72442 -0.557642 0.466705
0.619142 -1.08521 -1.85976 -1.19759 -0.763672
-0.929235 0.901898 -0.717973 -2.60481 0.0617718
-0.391341 0.258408 -0.368092 0.53643 -1.35026
0.242555 -1.5091 -0.00501431 0.30317 1.244
I've started looking at this; so far, I've replicated these results, but nothing strikes me as obviously wrong about the implementation. I'll run some more detailed tests to see if I can find anything.
I would probably tackle this by printing out detailed information about each iteration (both for sort-means and for the naive algorithm), identifying the earliest iteration where something goes wrong (some assignment is incorrect), and drilling into why that happened.
Will do!