machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Memory leak when using ML.NET and TensorFlow.NET.

Open ddobric opened this issue 2 years ago • 6 comments

System Information

  • Linux Docker container running on Windows 10 / 11 or in ACI
  • ML.NET v1.4.0
  • .NET Version: .NET 6.0

Problem description

We have an application that utilizes image classification by using TensorFlow, based on the sample provided in .NET ML. The application executes the following code, which correctly starts, runs, but never completes when running in the Linux docker container on the Windows 10 host.

When the container is started from the command line it simply exists without showing any error. However, when we run the container in VS (F5) the container shows the error in the output window.

TensorFlow .. metaoprimizer.cc 499 ..model_runner exceeded deadline

Here is the simplified code that fails (crashes) in line N+0.

try
{
(N+0) var cvResults = mlContext.MulticlassClassification.CrossValidate(cvDataView, pipeline, numberOfFolds: numOfFolds, labelColumnName: "LabelAsKey", seed: 8881);

(N+1)  Console.Write("Never executed");
}
catch
{
(N+2)
}

The container simply exists without throwing any exception. The lines N+1 and N+2 are never reached.

image

The same code works well when executed on the Windows host. We also have been able to execute the same code in the Linux container on Windows 11. We have tested it on multiple machines (same environment setup). The same code at some hosts is sometimes running correctly even if the execution time is longer (see image above).

Recap

Question: Is there a way to set the timeout for TensorFlow via ML.NET?

However, this behaviour also raises another question. Is this .NET + Docker + TensorFlow reliable at all if it doesn't show any error and stops the execution?

ddobric avatar Nov 17 '21 13:11 ddobric

In the meantime, we have investigated this issue and found that ML.NET in combination with TensorFlow has a serious issue when executing in Docker container. The application that we have implemented works perfectly when running on Windows or Linux. We have tested it on various configurations (2-8 cores with 4-32 GB RAM).

Unfortunately, when running the same application in a docker container in the ACI and locally (Docker Desktop on Windows), the host very often terminates the application without any error.

Before terminating, the application traces following:

Saver not created because there are no variables in the graph to restore Restoring parameters from /tmp/sqkk5fg1.fif/custom_retrained_model_based_on_inception_v3.meta Froze 2 variables. Converted 2 variables to const ops.

I guess this is coming from TensorFlow.

When the application is terminated the ACI host exits with the following status:

 "exitCode": 137,
"detailStatus": "OOMKilled"

The application is terminated in 30-50% of cases. That means, sometimes completes successfully. We have tested this with any possible configuration (--cpu, --memory) in ACI. It does not work, because of ML.NET+TensorFlow tends to consume all available memory. If consumption reached the limit, which is not tolerated by the host, the host will be terminated. When running the application on the OS directly (no container virtualization) the OS does obviously a better job than the container orchestrator. The solution seems to be to take control of the RAM consumption of ML.NET+TensorFlow and limit it to let's say to same value less than the container available value. For example:

container available memory - 10%

Right now, it looks like, the ML.NET with the described TensorFlow scenario is not usable in containers.

ddobric avatar Nov 24 '21 07:11 ddobric

@ddobric I'm trying to look into this. Would you be able to share a simple repro? How much memory are you giving your containers? From the last paragraph, it looks like as long as you limit the ram consumption it will run to completion, is that correct?

michaelgsharp avatar Jan 07 '22 19:01 michaelgsharp

@michaelgsharp, once I have a simplified application with this issue, I will come back to this. The current application is just too complex and it would take me a long time to provide a simplified repro. You are right about the limit of memory consumption. However, the problem is that this limit will be reached with TensorFlow+ML.NET for any possible limit in the universe.

For example, you set the --memory to 10g. It will take some time and exceed it. The host will terminate the container with 137. Then you take a limit --memory 100g, it will happen the same in just 10 times longer time. That is the issue, meaning there is not value you can set to make it working.

The play between host and TensorFlow+.ML.NET behaves as a memory leak.

ddobric avatar May 02 '22 07:05 ddobric

I see. So basically we have a memory leak somewhere between our ML.NET and TensorFlow.NET interactions. Let me update the title of this issue and mark this for further investigation.

michaelgsharp avatar May 02 '22 23:05 michaelgsharp

Do we have an update on that issue?

julianogimenez avatar Jan 11 '24 11:01 julianogimenez