clustering icon indicating copy to clipboard operation
clustering copied to clipboard

different runs of k-means clustering result in different outputs

Open ghost opened this issue 10 years ago • 6 comments

var colors = [
   [97],
   [1],
   [53],
   [79],
   [3],
   [351],
   [16]
];

var clusters = clusterfck.kmeans(colors, 3);

Result A: [1, 3, 16], [53, 79, 97], [351] Result B: [1, 3, 16, 53], [79, 97], [351]

ghost avatar Feb 25 '15 14:02 ghost

That's normal, kmeans places the initial seeds (cluster centers) randomly. So each run will have a different initial set of seed locations, and as such (slightly) different outcomes. See for a nice introduction to k-means and clustering: http://web.cs.sunyit.edu/~mike/cs542/Jain50YearsBeyondKmeans.pdf

bbroeksema avatar Feb 27 '15 19:02 bbroeksema

Thanks for the literature. However, this behaviour should be explicitly mentioned somewhere, because in other tools (i.e., R, Weka) the default k-means implementation can handle such cases.

ghost avatar Mar 02 '15 14:03 ghost

How does R and Weka handle it? Do they use the same random seed for each run?

Ouwen avatar Mar 02 '15 19:03 Ouwen

In R you can pass "centers" which is either the number of clusters (which will result in similar undeterministic behavior) or actual initial, distinct, cluster centers (in case, I believe but not actually checked, it will behave deterministic). I don't know about weka.

bbroeksema avatar Mar 03 '15 08:03 bbroeksema

You could modify the kmeans function so instead of saying this.centroids = this.randomCentroids(...) you could pass the centroids in as an argument. That should allow different runs to produce the same results.

user24 avatar Jun 10 '15 06:06 user24

Often K Means is run multiple times and there is an error measurement calculated as the mean square distance of each point to the cluster centroid to which it belongs. You can then use the clustering result that minimizes this error as your centroids.

tayden avatar Feb 11 '16 00:02 tayden