Clustering.jl icon indicating copy to clipboard operation
Clustering.jl copied to clipboard

Fix broadcasting bug in `repick_unused_centers()` for K-means

Open AzamatB opened this issue 1 year ago • 2 comments

AzamatB avatar Sep 07 '24 11:09 AzamatB

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.40%. Comparing base (b4df21a) to head (5f280eb). Report is 16 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #283      +/-   ##
==========================================
- Coverage   95.40%   95.40%   -0.01%     
==========================================
  Files          20       19       -1     
  Lines        1503     1500       -3     
==========================================
- Hits         1434     1431       -3     
  Misses         69       69              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 07 '24 11:09 codecov-commenter

This is a fix to a pretty significant correctness bug. Can we get this merged ASAP?

AzamatB avatar Sep 21 '24 12:09 AzamatB

@AzamatB Thank you for the fix and sorry for very long delay in reviewing it! Do you have a test case for the fix?

alyst avatar Dec 01 '24 20:12 alyst

It is clearly a bug, it affects the weights of points during empty clusters reinitialization. But from what I can understand, it should not significantly affect the correctness of the results. In some rare cases it can lead to the same point becoming the center of two unused clusters -- but it is hard to construct this case. We should definitely fix it, but it would be nice to have a test case that reveals the problematic behaviour. So if you discovered this bug because kmeans() generated wrong results for your data, it would be nice to have that example.

alyst avatar Jan 06 '25 04:01 alyst

It's actually not easy at all to find the case when this bug would result in a different clustering (and that does not mean at all that the total cost would be worse) -- just because the situation where you would have to re-pick an unused cluster is rare, and re-picking 2 unused clusters (that's where the bug can start manifesting) is very rare. FWIW, here's the one that I have found:

Z = Float32[
    0.7469281 8.85936 26.104477 17.041048 -2.9290304 3.6850576 9.111796 -3.7082646 28.021303 -1.7768137 15.129808 10.813229 -0.37364542 9.3696785 0.058578808 16.494686 26.311668 26.63891 8.56788 24.445705 26.41015 9.561551 16.240145 15.186173 26.556469 26.59538 2.0695605 -0.21763714 -0.76779276 11.00544 -2.1536446 25.703747 24.663855 8.074589 26.758509 25.25317 1.5967612 0.31046262 0.55812395 8.941144 9.466517 25.507006 26.757868 2.425164 25.087585 26.443806 26.326286 -2.9491603 28.033964 9.664289 26.4217 23.865585 13.599952 12.534004 26.082083 9.602425 13.989668 -3.7962832 3.0748796 0.32880965 2.7096572 17.04881 9.982845 1.8529501 27.728786 -1.7613807 -4.2353554 0.8132015 24.837885 26.31062 1.0547543 1.3947017 3.0518942 -2.6900654 24.62906 -1.4229952 25.314922 10.536686 1.6732222 12.771361 12.719584 12.273293 15.357906 25.722084 24.90772 26.05082 1.7522457 2.5748842 -1.2829331 -2.350809 9.122712 15.469745 11.674798 2.4657478 -0.29506856 15.895254 3.0172286 10.194327 -1.0089047 16.014812
    10.079329 18.360752 9.28761 -10.738691 -11.377904 14.796508 14.586522 -9.868497 9.238739 -9.8855295 -9.807078 17.22134 9.412983 -1.2908658 6.4643354 -10.106296 10.545128 11.217348 16.879663 -3.241572 9.065372 -3.4642792 -10.768875 -11.025509 9.280065 6.852204 10.405429 10.598914 9.294289 15.889821 9.373185 -3.4529555 -4.348801 15.373303 -4.503598 7.714737 12.660385 10.243679 10.298691 14.378907 -3.7512515 -2.8529565 -2.5031896 13.335922 -2.0082817 8.725614 9.936363 -10.26723 8.548306 14.545052 -2.9661767 -3.8050623 -2.4869242 -1.7874911 8.777304 -2.453476 -1.549685 -10.284147 13.414661 10.497435 14.179338 -10.703788 15.013265 10.063806 -5.3631563 7.3521304 -9.017388 10.804814 -2.910087 -3.1865885 9.722448 14.545546 13.292266 -8.730829 -3.5044644 10.917173 8.672507 15.791334 12.100385 -2.5765386 -4.6828575 -2.410526 -11.11536 -4.988103 -4.856656 -3.0215774 11.570467 13.921742 6.850209 9.268251 -1.5591162 -3.2451906 15.143096 9.793272 11.128278 -10.880159 13.058364 -2.2190018 8.365673 -11.913434
]

rng = StableRNG(232424)
res = kmeans(Z, 10; rng=rng)

alyst avatar Jan 06 '25 05:01 alyst