TensorFlow.NET icon indicating copy to clipboard operation
TensorFlow.NET copied to clipboard

Memory leak on Linux

Open deadman2000 opened this issue 4 years ago • 18 comments

Session close calling not released resources and tf threads are not stopped.

Example project demonstrated threads count increase: https://github.com/deadman2000/TFNetMemoryLeak

It's not TF bug, i tested on similar project on C: https://github.com/deadman2000/TFCThreadTest

It's Linux only problem. On Windows resources released success

deadman2000 avatar Oct 04 '19 16:10 deadman2000

I suggest root of problem in using ThreadLocal. Graph never disposed

deadman2000 avatar Oct 10 '19 09:10 deadman2000

@deadman2000 You have to dispose Graph explictly.

Oceania2018 avatar Oct 10 '19 11:10 Oceania2018

Manually disposing Graph not helps

            session.graph.Dispose();
            session.close();

Test log:

2019-10-10 12:42:43.235019: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: model/
2019-10-10 12:42:43.238466: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-10-10 12:42:43.249303: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2019-10-10 12:42:43.276624: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: model/
2019-10-10 12:42:43.284454: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 49441 microseconds.
Dispose tf.Tensor '<unnamed Operation>' shape=(1,2) dtype=TF_FLOAT
Dispose Tensor disposing:True _disposed:False
  TF_DeleteTensor
Dispose grap-key-45/, (39260016)
Dispose Graph disposing:True _disposed:False
  TF_DeleteGraph
Dispose Tensorflow.Session
Dispose Session disposing:True _disposed:False
DisposeUnmanagedResources
  TF_DeleteSession
Dispose Tensorflow.Status
Dispose Status disposing:True _disposed:False
  TF_DeleteStatus
End
Threads: 71
Press Q to break or any another to repeat
2019-10-10 12:42:43.928763: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: model/
2019-10-10 12:42:43.932001: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-10-10 12:42:43.941843: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2019-10-10 12:42:43.969191: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: model/
2019-10-10 12:42:43.973776: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 45020 microseconds.
Dispose tf.Tensor '<unnamed Operation>' shape=(1,2) dtype=TF_FLOAT
Dispose Tensor disposing:True _disposed:False
  TF_DeleteTensor
Dispose grap-key-49/, (38424432)
Dispose Graph disposing:True _disposed:False
  TF_DeleteGraph
Dispose Tensorflow.Session
Dispose Session disposing:True _disposed:False
DisposeUnmanagedResources
  TF_DeleteSession
Dispose Tensorflow.Status
Dispose Status disposing:True _disposed:False
  TF_DeleteStatus
End
Threads: 74
Press Q to break or any another to repeat

deadman2000 avatar Oct 10 '19 12:10 deadman2000

This might also leak in Linux for an entirely different reason but for me on Windows; Every time you run program (press any key in the console program in the repository) there is memory build up.

image

Only the following line causes leaking (with or without .as_default().

var session = Session.LoadFromSavedModel(modelLocation);`

Therefore the leak is within loading a saved model and disposing it later on. @Oceania2018 To your attention.

Nucs avatar Oct 16 '19 15:10 Nucs

I have the same problem with Linux

gosha20777 avatar Jun 05 '21 20:06 gosha20777

I'm having a similar problem, I've got about 20 unit tests and memory isn't being completely returned after each one. I'm calling: _session.graph.Dispose(); _session.Dispose();

I'm seeing slight growth after each one that loads a saved model image

I am also using Session.LoadFromSavedModel(modelLocation);

ADH-LukeBollam avatar Jun 14 '21 09:06 ADH-LukeBollam

@gosha20777 @LukeBolly Could you PR a minimul runnable code into https://github.com/SciSharp/TensorFlow.NET/tree/master/src/TensorFlowNet.Benchmarks/Leak?

Oceania2018 avatar Jun 14 '21 12:06 Oceania2018

@Oceania2018 I haven't had time to create a repro for you, but I've been debugging another issue and have come across some stuff leftover from TensorFlow.NET in a memory dump after my TF process has finished and calling dispose on the graph and session

image

Confirming the objects are still in memory, NumSharp image

TensorFlow.NET image

Could this be the UnmanagedMemoryBlocks not being released?

ADH-LukeBollam avatar Aug 25 '21 07:08 ADH-LukeBollam

@LukeBolly Could you run it in latest release? We've removed NumSharp dependency.

Oceania2018 avatar Aug 25 '21 14:08 Oceania2018

Hi @Oceania2018, I've updated to the latest version. After running all of my unit tests and checking the managed memory at the end, there are still a large number of Tensorflow objects left in memory.

I've gone through my code and added Dispose() to all graphs, sessions, NDArrays, but I still end up with this locked up:

image

ADH-LukeBollam avatar Aug 31 '21 04:08 ADH-LukeBollam

While I can't share the model I'm using, loading and disposing it in a loop confirms there is an issue cleaning up resources when working with a SavedModel.

public class TestModel
{
        public TestModel(string classifierModelPath )
        {
	        for (var i = 0; i < 1000; i++)
	        {
		        var _classifierSession = Session.LoadFromSavedModel(classifierModelPath );
        
		        _classifierSession.graph.Exit();
		        _classifierSession.graph.Dispose();
		        _classifierSession.Dispose();
	        }
        }
}

image

ADH-LukeBollam avatar Aug 31 '21 08:08 ADH-LukeBollam

@LukeBolly Does it apply to any model? whether the other models cause the same issue? Another way might be able to release the resource is tf.Context.reset_context().

Oceania2018 avatar Aug 31 '21 14:08 Oceania2018

Yep, I've run it again with a very simple model which is just a bunch of Conv layers and a call signature, and I'm seeing the same behavior. Adding tf.Context.reset_context() did not resolve the issue.

Here are all the operations in the model

image

This model is small and initialises much faster so the chart is smoother, but its the same behavior. It's looped about 500 times here: image

I can probably get a repro up tomorrow for you if you need?

ADH-LukeBollam avatar Aug 31 '21 14:08 ADH-LukeBollam

@LukeBolly It will very helpful if you can create a runnable project to reproduce this issue.

Oceania2018 avatar Aug 31 '21 15:08 Oceania2018

I'll try to get a repro up for you tomorrow if I get time.

ADH-LukeBollam avatar Aug 31 '21 15:08 ADH-LukeBollam

@Oceania2018 Unfortunately the fix has broken LoadFromSavedModel entirely, see here:

image

The graph is disposed as soon as it loads, as a result it isn't usable in any way.

image

ADH-LukeBollam avatar Sep 08 '21 04:09 ADH-LukeBollam

@LukeBolly Sorry for that, I disposed graph accidently, will fix it in the future release.

image

Oceania2018 avatar Sep 08 '21 11:09 Oceania2018

@Oceania2018 I've put up a PR that fixes the issue with the graph and extended the test to ensure it runs, it seems like there is still a small leak somewhere though: #858

ADH-LukeBollam avatar Sep 14 '21 06:09 ADH-LukeBollam