machinelearning
machinelearning copied to clipboard
Memory leak by Auto ML
System Information (please complete the following information):
- OS & Version: Windows 10
- ML.NET Version: ML.NET 2.0 preview (but have the same results when using 1.7.1 versions)
- .NET Version: NET 5.0
Describe the bug I have noticed that time to time my kubernetas pod faced Out of memory exception. Regarding to Grafana it starts with 500 Mb and reach 6 Gb after some times of processing of queue of learning.
Then I have tried to reproduce the problem on my local machine and looks like I have found a problem by memory profiling:
// Call AutoML learning inside
await container.Resolve<IMlModelLearningPartTrainer>().Learn(Guid.Parse("43f60690-a594-4e8a-93bd-a91a2d836139"));
// After learning finishing forcely call the garbage collector
GC.Collect();
// Here I make a memory profiling
And this is what I found:

Looks like AutoML use some timer callback inside to stop the learning. But for some reason it's not disposing event after learning is already finished. This timer lambda catch the outer context of learning and this is why GC can't dispose learning data.
And after 1 hour after end of the learning - this learning data are still in the memory. It's a memory leak.
Hope this snapshots of our code help you. This is how we start the learning:
var settings = new BinaryExperimentSettings
{
MaxExperimentTimeInSeconds = (uint) maxExperimentTime,
OptimizingMetric = trainerAndMetric.Metric
};
settings.Trainers.Clear();
settings.Trainers.Add(trainerAndMetric.Trainer);
var experiment = mlContext.Auto().CreateBinaryClassificationExperiment(settings);
To Reproduce Steps to reproduce the behavior:
- Start an AutoML learning for 5-10 minutes
- Whait the end of the learning
- Call GC.Collect
- Take a memory snapshot
- Check the data. Learning data are still in the memory
Expected behavior No data leak :)
Additional context I reproduce the problem in 2.0 preview version of ML.NET and AutoML. But we upgrade the version tomorrow tring to solve the memory problem. We have the same out of memory problem at 2.0 preview and 1.7.1 versions. So I assume that the 1.7.1 version have the same memory leak problem.
Is there any temporary workaround?
@LittleLittleCloud any thoughts on this? Or how we could fix the timer so it won't hold the data?
@michaelgsharp @LittleLittleCloud
I am not 100% sure, but looks like you have to dispose _maxExperimentTimeTimer at https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.AutoML/Experiment/Experiment.cs
Maybe MaxExperimentTimeExpiredEvent method is a good place to call _maxExperimentTimeTimer.Dispose()
The last trial in AutoML sometimes doesn't get cancelled even after times up, you can force cancel it by calling
context.CancelExecution()
You can also check this #https://github.com/dotnet/machinelearning/issues/6188 for further explain as well
Thank you!
@80LevelElf
Does CancelExecution() solves your issue
@LittleLittleCloud hi!
Yes, looks like it does
Closing as it looks like this was resolved. Please feel free to re-open if that's not the case.