machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed

Open mikegoatly opened this issue 3 years ago • 4 comments

System Information (please complete the following information):

  • OS & Version: Windows 10
  • ML.NET Version: ML.NET v1.7.1 (also tested with 2.0.0-preview.22313.1)
  • .NET Version: NET 6.0

Describe the bug When creating a KMeans cluster prediction engine for a training data set that does not change, the predicted cluster ids are not consistent, even when the seed is specified for the MLContext.

To Reproduce For this fixed data set:

using Microsoft.ML;
using Microsoft.ML.Data;

public class ModelData
{
    public float Value1 { get; set; }
    public float Value2 { get; set; }
}

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedClusterId;

    [ColumnName("Score")]
    public float[] Distances = null!;

    [ColumnName("Features")]
    public float[] Features = null!;
}

var data = Enumerable.Range(0, 60).Select(x => new ModelData { Value1 = Random.Shared.Next(0, 2000), Value2 = Random.Shared.Next(0, 7) }).ToList();

And this function to create a new instance of the prediction engine:

const string FeaturesColumnName = "Features";
const int ClusterCount = 4;

public PredictionEngine<ModelData, ClusterPrediction> CreateModel(IEnumerable<ModelData> data)
{
    var mlContext = new MLContext(seed: 0);

    var dataView = mlContext.Data.LoadFromEnumerable(data);

    IEstimator<ITransformer> pipeline = mlContext.Transforms
        .Concatenate(FeaturesColumnName, new[] { nameof(ModelData.Value1), nameof(ModelData.Value2) })
        .Append(mlContext.Clustering.Trainers.KMeans(FeaturesColumnName, numberOfClusters: ClusterCount));

    var model = pipeline.Fit(dataView);

    return mlContext.Model.CreatePredictionEngine<ModelData, ClusterPrediction>(model);
}

We should be able to create the same prediction engine producing the same results many times. The following creates the engine in a loop and calculates the cluster ids for each of the data set's data points, displaying the number of items that end up in each of the clusters:

using System.Linq;

for (var i = 0; i < 10; i++)
{
    var engine = CreateModel(data);

    var clusterCounts = data.Select(d => engine.Predict(d).PredictedClusterId).ToLookup(x => (int)x);

    Console.WriteLine(string.Join(" ", Enumerable.Range(1, ClusterCount).Select(x => $"Cluster {x}: {clusterCounts[x].Count()} items")));
}

This outputs:

Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 13 items Cluster 2: 20 items Cluster 3: 12 items Cluster 4: 15 items
Cluster 1: 15 items Cluster 2: 15 items Cluster 3: 17 items Cluster 4: 13 items
Cluster 1: 23 items Cluster 2: 22 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 22 items Cluster 2: 23 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 13 items Cluster 3: 15 items Cluster 4: 12 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items

Expected behavior I would expect that each time the cluster is constructed from an MLContext with a fixed seed, the predicted cluster counts would be identical, with the same data points associated to them.

Screenshots, Code, Sample Projects I've attached a .NET Interactive notebook (zipped) for ease of reproduction.

mikegoatly avatar Oct 13 '22 09:10 mikegoatly

Further investigation has shown that if I use the KMeansPlusPlus initialization algorithm then the clustering becomes deterministic, so this looks like it's a bug in the KMeansYinyang initialization algorithm.

mikegoatly avatar Nov 17 '22 21:11 mikegoatly

Thanks for finding this for us! We will take a look and see what we can figure out.

michaelgsharp avatar Nov 28 '22 19:11 michaelgsharp

#7429 STILL HAPPENS.

superichmann avatar Mar 27 '25 10:03 superichmann

im doing it

samuelyao107 avatar Sep 01 '25 14:09 samuelyao107