machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Memory leak by Auto ML

Open 80LevelElf opened this issue 3 years ago • 7 comments

System Information (please complete the following information):

  • OS & Version: Windows 10
  • ML.NET Version: ML.NET 2.0 preview (but have the same results when using 1.7.1 versions)
  • .NET Version: NET 5.0

Describe the bug I have noticed that time to time my kubernetas pod faced Out of memory exception. Regarding to Grafana it starts with 500 Mb and reach 6 Gb after some times of processing of queue of learning.

Then I have tried to reproduce the problem on my local machine and looks like I have found a problem by memory profiling:

// Call AutoML learning inside
await container.Resolve<IMlModelLearningPartTrainer>().Learn(Guid.Parse("43f60690-a594-4e8a-93bd-a91a2d836139"));

// After learning finishing forcely call the garbage collector
GC.Collect();

// Here I make a memory profiling

And this is what I found:

Leak

Looks like AutoML use some timer callback inside to stop the learning. But for some reason it's not disposing event after learning is already finished. This timer lambda catch the outer context of learning and this is why GC can't dispose learning data.

And after 1 hour after end of the learning - this learning data are still in the memory. It's a memory leak.

Hope this snapshots of our code help you. This is how we start the learning:

        var settings = new BinaryExperimentSettings
        {
            MaxExperimentTimeInSeconds = (uint) maxExperimentTime,
            OptimizingMetric = trainerAndMetric.Metric
        };
        
        settings.Trainers.Clear();
        settings.Trainers.Add(trainerAndMetric.Trainer);

        var experiment = mlContext.Auto().CreateBinaryClassificationExperiment(settings);

To Reproduce Steps to reproduce the behavior:

  1. Start an AutoML learning for 5-10 minutes
  2. Whait the end of the learning
  3. Call GC.Collect
  4. Take a memory snapshot
  5. Check the data. Learning data are still in the memory

Expected behavior No data leak :)

Additional context I reproduce the problem in 2.0 preview version of ML.NET and AutoML. But we upgrade the version tomorrow tring to solve the memory problem. We have the same out of memory problem at 2.0 preview and 1.7.1 versions. So I assume that the 1.7.1 version have the same memory leak problem.

80LevelElf avatar Jul 13 '22 20:07 80LevelElf

Is there any temporary workaround?

80LevelElf avatar Jul 14 '22 13:07 80LevelElf

@LittleLittleCloud any thoughts on this? Or how we could fix the timer so it won't hold the data?

michaelgsharp avatar Jul 14 '22 20:07 michaelgsharp

@michaelgsharp @LittleLittleCloud

I am not 100% sure, but looks like you have to dispose _maxExperimentTimeTimer at https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.AutoML/Experiment/Experiment.cs

Maybe MaxExperimentTimeExpiredEvent method is a good place to call _maxExperimentTimeTimer.Dispose()

80LevelElf avatar Jul 15 '22 12:07 80LevelElf

The last trial in AutoML sometimes doesn't get cancelled even after times up, you can force cancel it by calling

context.CancelExecution()

You can also check this #https://github.com/dotnet/machinelearning/issues/6188 for further explain as well

LittleLittleCloud avatar Jul 15 '22 18:07 LittleLittleCloud

Thank you!

80LevelElf avatar Jul 15 '22 18:07 80LevelElf

@80LevelElf Does CancelExecution() solves your issue

LittleLittleCloud avatar Jul 26 '22 19:07 LittleLittleCloud

@LittleLittleCloud hi!

Yes, looks like it does

80LevelElf avatar Jul 26 '22 19:07 80LevelElf

Closing as it looks like this was resolved. Please feel free to re-open if that's not the case.

tannergooding avatar Aug 25 '22 21:08 tannergooding