TensorFlow.NET Memory leak on Linux

Memory leak on Linux

Open deadman2000 opened this issue 4 years ago • 18 comments

Session close calling not released resources and tf threads are not stopped.

Example project demonstrated threads count increase: https://github.com/deadman2000/TFNetMemoryLeak

It's not TF bug, i tested on similar project on C: https://github.com/deadman2000/TFCThreadTest

It's Linux only problem. On Windows resources released success

Oct 04 '19 16:10 deadman2000

I suggest root of problem in using ThreadLocal. Graph never disposed

Oct 10 '19 09:10 deadman2000

@deadman2000 You have to dispose Graph explictly.

Oct 10 '19 11:10 Oceania2018

Manually disposing Graph not helps

            session.graph.Dispose();
            session.close();

Test log:

2019-10-10 12:42:43.235019: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: model/
2019-10-10 12:42:43.238466: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-10-10 12:42:43.249303: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2019-10-10 12:42:43.276624: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: model/
2019-10-10 12:42:43.284454: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 49441 microseconds.
Dispose tf.Tensor '<unnamed Operation>' shape=(1,2) dtype=TF_FLOAT
Dispose Tensor disposing:True _disposed:False
  TF_DeleteTensor
Dispose grap-key-45/, (39260016)
Dispose Graph disposing:True _disposed:False
  TF_DeleteGraph
Dispose Tensorflow.Session
Dispose Session disposing:True _disposed:False
DisposeUnmanagedResources
  TF_DeleteSession
Dispose Tensorflow.Status
Dispose Status disposing:True _disposed:False
  TF_DeleteStatus
End
Threads: 71
Press Q to break or any another to repeat
2019-10-10 12:42:43.928763: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: model/
2019-10-10 12:42:43.932001: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-10-10 12:42:43.941843: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2019-10-10 12:42:43.969191: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: model/
2019-10-10 12:42:43.973776: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 45020 microseconds.
Dispose tf.Tensor '<unnamed Operation>' shape=(1,2) dtype=TF_FLOAT
Dispose Tensor disposing:True _disposed:False
  TF_DeleteTensor
Dispose grap-key-49/, (38424432)
Dispose Graph disposing:True _disposed:False
  TF_DeleteGraph
Dispose Tensorflow.Session
Dispose Session disposing:True _disposed:False
DisposeUnmanagedResources
  TF_DeleteSession
Dispose Tensorflow.Status
Dispose Status disposing:True _disposed:False
  TF_DeleteStatus
End
Threads: 74
Press Q to break or any another to repeat

Oct 10 '19 12:10 deadman2000

This might also leak in Linux for an entirely different reason but for me on Windows; Every time you run program (press any key in the console program in the repository) there is memory build up.

Only the following line causes leaking (with or without .as_default().

var session = Session.LoadFromSavedModel(modelLocation);`

Therefore the leak is within loading a saved model and disposing it later on. @Oceania2018 To your attention.

Oct 16 '19 15:10 Nucs

I have the same problem with Linux

Jun 05 '21 20:06 gosha20777

I'm having a similar problem, I've got about 20 unit tests and memory isn't being completely returned after each one. I'm calling: _session.graph.Dispose(); _session.Dispose();

I'm seeing slight growth after each one that loads a saved model

I am also using Session.LoadFromSavedModel(modelLocation);

Jun 14 '21 09:06 ADH-LukeBollam

@gosha20777 @LukeBolly Could you PR a minimul runnable code into https://github.com/SciSharp/TensorFlow.NET/tree/master/src/TensorFlowNet.Benchmarks/Leak?

Jun 14 '21 12:06 Oceania2018

@Oceania2018 I haven't had time to create a repro for you, but I've been debugging another issue and have come across some stuff leftover from TensorFlow.NET in a memory dump after my TF process has finished and calling dispose on the graph and session

Confirming the objects are still in memory, NumSharp

TensorFlow.NET

Could this be the UnmanagedMemoryBlocks not being released?

Aug 25 '21 07:08 ADH-LukeBollam

@LukeBolly Could you run it in latest release? We've removed NumSharp dependency.

Aug 25 '21 14:08 Oceania2018

Hi @Oceania2018, I've updated to the latest version. After running all of my unit tests and checking the managed memory at the end, there are still a large number of Tensorflow objects left in memory.

I've gone through my code and added Dispose() to all graphs, sessions, NDArrays, but I still end up with this locked up:

Aug 31 '21 04:08 ADH-LukeBollam

While I can't share the model I'm using, loading and disposing it in a loop confirms there is an issue cleaning up resources when working with a SavedModel.

public class TestModel
{
        public TestModel(string classifierModelPath )
        {
	        for (var i = 0; i < 1000; i++)
	        {
		        var _classifierSession = Session.LoadFromSavedModel(classifierModelPath );
        
		        _classifierSession.graph.Exit();
		        _classifierSession.graph.Dispose();
		        _classifierSession.Dispose();
	        }
        }
}

Aug 31 '21 08:08 ADH-LukeBollam

@LukeBolly Does it apply to any model? whether the other models cause the same issue? Another way might be able to release the resource is tf.Context.reset_context().

Aug 31 '21 14:08 Oceania2018

Yep, I've run it again with a very simple model which is just a bunch of Conv layers and a call signature, and I'm seeing the same behavior. Adding tf.Context.reset_context() did not resolve the issue.

Here are all the operations in the model

This model is small and initialises much faster so the chart is smoother, but its the same behavior. It's looped about 500 times here:

I can probably get a repro up tomorrow for you if you need?

Aug 31 '21 14:08 ADH-LukeBollam

@LukeBolly It will very helpful if you can create a runnable project to reproduce this issue.

Aug 31 '21 15:08 Oceania2018

I'll try to get a repro up for you tomorrow if I get time.

Aug 31 '21 15:08 ADH-LukeBollam

@Oceania2018 Unfortunately the fix has broken LoadFromSavedModel entirely, see here:

The graph is disposed as soon as it loads, as a result it isn't usable in any way.

Sep 08 '21 04:09 ADH-LukeBollam

@LukeBolly Sorry for that, I disposed graph accidently, will fix it in the future release.

Sep 08 '21 11:09 Oceania2018

@Oceania2018 I've put up a PR that fixes the issue with the graph and extended the test to ensure it runs, it seems like there is still a small leak somewhere though: #858

Sep 14 '21 06:09 ADH-LukeBollam

TensorFlow.NET TensorFlow.NET copied to clipboard

Memory leak on Linux

TensorFlow.NET
TensorFlow.NET copied to clipboard