machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

AutoML Binary Classification Experiment run for few second and finished without models

Open 80LevelElf opened this issue 2 years ago • 5 comments

System Information (please complete the following information):

  • OS & Version: mcr.microsoft.com/dotnet/sdk:5.0 docker image (Linux Alpine)
  • ML.NET Version: 3.0.0-preview.23511.1
  • .NET Version: NET 5.0

Describe the bug At this moment we use ML.net 2, but because of the bug fix of https://github.com/dotnet/machinelearning/pull/6571 we have to switch to 3 version of ML.net to train our Binary Classification models (we need Positive Recall optimization metric).

But looks like Binary Classification Experiment is somehow broken in 3 version of ML.net:

        var settings = new BinaryExperimentSettings
        {
            MaxExperimentTimeInSeconds = 30 * 60,
            //MaxModels = 10,
            OptimizingMetric = BinaryClassificationMetric.PositiveRecall,
            MaximumMemoryUsageInMegaByte = 7500,
            UseAutoZeroTuner = false
        };

        ExperimentResult<BinaryClassificationMetrics> experimentResult = experiment
            .Execute(trainDataView, nameof(MlModelRow.Label), nameof(MlModelRow.LearningGroup));

We use only FastForest and LightGBM trainers. On my local PC (Windows 10) it's working great, but in the production docker image (Alpine Linux) the learning is finished after 10-30 seconds with:

Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity

I have tried to:

  1. Use MaxModels = 10 with MaxExperimentTimeInSeconds
  2. Use MaxModels = 10 insted of MaxExperimentTimeInSeconds
  3. Turn UseAutoZeroTuner to true

But nothing is working for me. Important point - MLNET_BACKEND is not set so we are not using OneDAL on production or test environment.

80LevelElf avatar Nov 11 '23 12:11 80LevelElf

I have just try to switch to OneDAL mode for production. It doesn't help (

80LevelElf avatar Nov 11 '23 14:11 80LevelElf

Maybe some temporary workarounds?

I think it's really a big problem regarding to ml.net 3 should be released this month.

80LevelElf avatar Nov 15 '23 09:11 80LevelElf

I have tried it for new ML.net 3

The same behavior

80LevelElf avatar Nov 29 '23 12:11 80LevelElf

@LittleLittleCloud @luisquintanilla

Hi friends! Maybe is there any workaround or any thinks we can check on our side?

80LevelElf avatar Dec 09 '23 10:12 80LevelElf

So I have found out the problem - it is because of MaximumMemoryUsageInMegaByte = 7500

Just after starting the used memory become more that 7500 Mb and learning become canceled.

At first point it's understandable behavior, but it looks like very unuseful. In fact Ml.net doesn't rule memory consumption in our case. We have to choose between:

  • Ml.net start to train a lof models and soon we will get Out of memory for our kubernetes learning pod (~36 Gb)
  • Or ml.net cancel learning in 95% cases after reaching the limit

But can't ml.net control count of models to train at one time by memory limit? Like limit it 7500 Mb and one model need 2500 Mb to train - so let's start 3 models.

80LevelElf avatar Dec 22 '23 06:12 80LevelElf