java icon indicating copy to clipboard operation
java copied to clipboard

SavedModelBundle memory leak

Open lyubortk opened this issue 4 years ago • 36 comments
trafficstars

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04.5 LTS
  • TensorFlow installed from (source or binary): Included in java artifact
  • TensorFlow version: 2.4.1

Describe the current behavior Loading and closing SavedModelBundle in an infinite loop leads to OutOfMemoryError:

Exception in thread "main" java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (7828M) > maxPhysicalBytes (7828M)
	at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:695)
	at org.tensorflow.internal.c_api.AbstractTF_Status.newStatus(AbstractTF_Status.java:70)
	at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:401)
	at org.tensorflow.SavedModelBundle.access$000(SavedModelBundle.java:59)
	at org.tensorflow.SavedModelBundle$Loader.load(SavedModelBundle.java:68)
	at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:242)
	at com.mycompany.app.App.main(App.java:13)

I use 0.3.1 version of the library. It is the only dependency:

<dependency>
  <groupId>org.tensorflow</groupId>
  <artifactId>tensorflow-core-platform</artifactId>
  <version>0.3.1</version>
</dependency>

The memory leak is reproducible with a model from this repository: https://github.com/tensorflow/java/tree/master/tensorflow-core/tensorflow-core-api/src/test/resources/saved_model

Describe the expected behavior

I assume that the following code should run endlessly

Code to reproduce the issue

package com.mycompany.app;

import org.tensorflow.SavedModelBundle;
import java.net.URISyntaxException;
import java.nio.file.Paths;

public class App {
    public static void main(String[] args) throws URISyntaxException {
        var modelPath = Paths.get(App.class.getResource("/saved_model").toURI()).toString();
        while (true) {
            var model = SavedModelBundle.load(modelPath, "serve");
            model.close();
        }
    }
}

lyubortk avatar Apr 27 '21 15:04 lyubortk

I can replicate this with a suitably small -Xmx setting. Using 256m seems to make it happen within 1000 iterations. However it doesn't look like there are any Java objects sticking around, and the load and close paths look like they are all scoping the pointers appropriately, so I'm not sure where to start looking. For that error to come out of JavaCPP something must be holding onto a reference somewhere. I poked at it with JFR and couldn't see any objects that were sticking around, but I admit I'm not as familiar with that as other profiling tools.

@saudet any ideas on where we begin to run this down?

Craigacp avatar Apr 27 '21 16:04 Craigacp

I'd really like to get access to that CppMemoryChecker...

saudet avatar Apr 27 '21 23:04 saudet

@lyubortk Are you observing this leak in real code? When running the example in VisualVM for a fixed number of iterations the heap ends up clean after those iterations have finished with all the TF referring objects collected, so it might be that in the tight loop you're getting out ahead of the tracking thread that JavaCPP uses to check how much memory has been allocated, causing it to throw a transient OOM.

I'll try and run it in valgrind to see if we're losing things on the C side somewhere.

Also, what Java version & GC algorithm are you using? I'm running tests on Java 16, which has a bunch of differences in GC behaviour and how it releases memory back to the OS vs Java 8, so I guess the resident set size check that JavaCPP is doing could be behaving differently for you.

Edit: I do observe the resident set size increasing over time, which is odd. Trying to run it down.

Craigacp avatar Apr 28 '21 13:04 Craigacp

@saudet , can you explain a bit more how the memory allocated by a BytePointer or a PointerPointer, like the ones instantiated here, is deallocated?

They seem to have no deallocator assigned to them and we can't trace where and how the buffer they allocate will be freed, any guidance would be welcomed, thanks

karllessard avatar Apr 28 '21 14:04 karllessard

@saudet , can you explain a bit more how the memory allocated by a BytePointer or a PointerPointer, like the ones instantiated here, is deallocated?

They seem to have no deallocator assigned to them and we can't trace where and how the buffer they allocate will be freed, any guidance would be welcomed, thanks

They do have deallocators, which get called by the PointerScope. You can call toString() on them to check for yourself.

saudet avatar Apr 28 '21 15:04 saudet

Looks like valgrind found some issues in the C API, and as only TF Java and TF Go use that, they might not be checking for it in regular TF builds, so I opened an issue on tensorflow/tensorflow.

Craigacp avatar Apr 28 '21 16:04 Craigacp

@lyubortk Are you observing this leak in real code?

Yes, I do experience this problem in a real code. A new model is loaded approximately once an hour so it seems like the problem is not related to JavaCPP thread lagging behind.

Also, what Java version & GC algorithm are you using?

I use openjdk 15 and G1 GC

lyubortk avatar Apr 28 '21 16:04 lyubortk

Ok, we'll see what they say in about the leaks valgrind found in the C API. Seems like the problem (or at least some of the problem) is there.

Craigacp avatar Apr 28 '21 17:04 Craigacp

Hi folks, have there been any updates on this? We seem to be experiencing a similar issue in a slightly different setup: pods running TF Java code get OOMKilled by k8s right after a new model is loaded and the old one is closed.

This was happening continuously when heap was at 70% of total available memory; when we reduced it to 50% the issue started occurring less frequently but still wasn't completely gone. Please let me know if you need more information on the use case and environment

dennisTS avatar Jul 05 '21 11:07 dennisTS

Try the latest snapshot, we fixed a leak in the saved model bundle in it. Otherwise what jdk and GC algorithm are you using? There's a known issue with JavaCPP's memory account and the ZGC collector which causes spurious OOMs.

Craigacp avatar Jul 05 '21 12:07 Craigacp

Thank you, @Craigacp , will try it and get back here.

For the reference, we're on OpenJDK 11.0.11 and using G1

dennisTS avatar Jul 05 '21 18:07 dennisTS

A quick update from the field: our app seems to be handling model reload properly when both of the conditions below are true:

  1. TF version used is 0.4.0-SNAPSHOT
  2. Heap doesn't occupy more than 50% of the memory available for the pod

Some more details about our use case: it's a realtime application that's handling up to 600 RPS per pod, each request resulting in calls to org.tensorflow.Session.Runner#run for 2 different models (out of 4 being loaded into memory in total) and each model being fed (org.tensorflow.Session.Runner#feed) up to 100 input tensors per request.

On model reload we first do SavedModelBundle.load(s3Path) for the new models, stop routing requests to the old models and then call org.tensorflow.SavedModelBundle#close on the old models. That seems to be exactly the time when the pods were getting OOMKilled

Additional note: the issue was only occurring under load

dennisTS avatar Jul 07 '21 14:07 dennisTS

Are they dying due to the JVM throwing OutOfMemoryError (either from JavaCPP or the JVM itself), or because k8s decided that pod was over the memory limit?

Also it sounds like from your use case that there are times where you could have 8 models loaded into a single running JVM. Is that right?

Craigacp avatar Jul 07 '21 14:07 Craigacp

Are they dying due to the JVM throwing OutOfMemoryError (either from JavaCPP or the JVM itself), or because k8s decided that pod was over the memory limit?

Also it sounds like from your use case that there are times where you could have 8 models loaded into a single running JVM. Is that right?

I think @lyubortk meant "4 models loaded in memory" to cover the case where both models are reloaded simultaneously?

That can still require a lot of memory depending on the size of your model, are you sure the pod has plenty of free memory space to handle this peak?

karllessard avatar Jul 07 '21 15:07 karllessard

Are they dying due to the JVM throwing OutOfMemoryError (either from JavaCPP or the JVM itself), or because k8s decided that pod was over the memory limit?

There is no OOM, it's done by k8s OOMKiller.

Also it sounds like from your use case that there are times where you could have 8 models loaded into a single running JVM. Is that right?

We reload them one by one, so in theory it should be maximum 5 - but it might depend on GC and the behaviour of org.tensorflow.SavedModelBundle#close

That can still require a lot of memory depending on the size of your model, are you sure the pod has plenty of free memory space to handle this peak?

Well, the pod has 13G available in total and 50% of it is dedicated to heap - in this setup reloads seem to be happening without issues (on SNAPSHOT). But as soon as we increase heap size to 70%, the pod starts getting OOMKilled when reloading all models

dennisTS avatar Jul 08 '21 08:07 dennisTS

Are they dying due to the JVM throwing OutOfMemoryError (either from JavaCPP or the JVM itself), or because k8s decided that pod was over the memory limit?

There is no OOM, it's done by k8s OOMKiller.

Also it sounds like from your use case that there are times where you could have 8 models loaded into a single running JVM. Is that right?

We reload them one by one, so in theory it should be maximum 5 - but it might depend on GC and the behaviour of org.tensorflow.SavedModelBundle#close

That can still require a lot of memory depending on the size of your model, are you sure the pod has plenty of free memory space to handle this peak?

Well, the pod has 13G available in total and 50% of it is dedicated to heap - in this setup reloads seem to be happening without issues (on SNAPSHOT). But as soon as we increase heap size to 70%, the pod starts getting OOMKilled when reloading all models

Ok. TF-Java stores all the tensors and models off heap as they are allocated by the TF native runtime. So this is probably a normal allocation pattern rather than a leak, you need to have sufficient memory free in the pod for the TF native allocations, and when you set the heap size too high the JVM takes more of the memory so the k8s OOM killer triggers when there is a large native allocation. With 50% of the RAM used for the Java heap there's probably enough room for all the native allocations and you stay under the threshold.

You don't mention how big the models, input tensors or output tensors are, but you should be able to calculate how much extra RAM those will take (along with some temporary memory for the activations as things propagate) and then figure it into your pod & JVM sizing.

Craigacp avatar Jul 08 '21 13:07 Craigacp

Thank you, @Craigacp, that makes sense

dennisTS avatar Jul 09 '21 10:07 dennisTS

Hi, any update on this ? I continue having this issue using 0.4.0 version In the ticket discussions in https://github.com/tensorflow/tensorflow/issues/48802 they said the issue is maybe solved in last release

Regards

torito avatar Mar 02 '22 11:03 torito

Could you describe your issue in more detail (e.g. trigger for the leak, size of it, which JVM, platform, any JVM GC configuration parameters)? The first issue mentioned in this thread was fixed some time ago, and the second issue was due to an interaction between k8s and the JVM's memory management, it wasn't a TF-Java issue. We're not aware of any leaks in the current version of TF-Java.

Craigacp avatar Mar 02 '22 14:03 Craigacp

Recently I meet with one interesting behavior, we deploy a k8s app with TF java, loading one existing TF model through SavedModelBundle.load and later we reload a new TF model version with the same approach, and then call the close on old model (SavedModelBundle) , then I found the mem usage increased, it seems the native mem is not reclaimed. But when I trigger a full gc, a large portion of mem is reclaimed, it seems G1GC we used doesn't feel any heap mem pressure so doesn't reclaim the mem in time.

Originally I thought If I call the close method on TF SavedModelBundle object, the native mem should be explicitly freed right after the call, but it seems not the reality, it seems only after a full gc, the mem will be freed. So I am curious about the behavior here. Anybody could help explain a bit? Thanks in advance!

zhenglaizhang avatar Jul 29 '22 08:07 zhenglaizhang

The close() method should reclaim the native memory right after the call. If that's not the case, we need to investigate. So you create no other session than the one coming with the saved model bundle?

karllessard avatar Jul 29 '22 17:07 karllessard

@karllessard thanks for quick clarification! yes, I didn't create any other session than the one from modelBundle.session(). btw, I don't know how exactly inspect the native memory of TF and just conclude from the memory still increases when new model loaded, and then I do oldBundle.close() that the native mem is not deallocated. So is there some systematic way to check the native mem allocation in TF native runtime?

zhenglaizhang avatar Aug 01 '22 02:08 zhenglaizhang

If you set the "org.bytedeco.javacpp.logger.debug" system property to "true", you'll get messages on the console when deallocation happens.

saudet avatar Aug 01 '22 03:08 saudet

Great, thanks @saudet I will give it a trial and update in the thread soon!

zhenglaizhang avatar Aug 01 '22 05:08 zhenglaizhang

I added the flag and it seems the debugging output for loading new model and releasing resource for older model is mixed: Below is the output around the call to oldBundle.close

Debug: Closing org.bytedeco.javacpp.PointerScope@462cab3d
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=null],count=0]
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Opening org.bytedeco.javacpp.PointerScope@23400c5c
Debug: Registering org.tensorflow.internal.c_api.TF_Output[address=0x7fd33fda20a0,position=0,limit=1,capacity=1,deallocator=org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]]
Debug: Attaching org.tensorflow.internal.c_api.TF_Output[address=0x7fd33fda20a0,position=0,limit=1,capacity=1,deallocator=org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]] to org.bytedeco.javacpp.PointerScope@23400c5c
Debug: Closing org.bytedeco.javacpp.PointerScope@23400c5c
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Opening org.bytedeco.javacpp.PointerScope@36f8e30
Debug: Registering org.tensorflow.internal.c_api.TF_Output[address=0x7fd33fda20a0,position=0,limit=1,capacity=1,deallocator=org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]]
Debug: Attaching org.tensorflow.internal.c_api.TF_Output[address=0x7fd33fda20a0,position=0,limit=1,capacity=1,deallocator=org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]] to org.bytedeco.javacpp.PointerScope@36f8e30
Debug: Registering org.tensorflow.internal.c_api.TF_Status[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=null],count=0]]
Debug: Attaching org.tensorflow.internal.c_api.TF_Status[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=null],count=0]] to org.bytedeco.javacpp.PointerScope@36f8e30
Debug: Closing org.bytedeco.javacpp.PointerScope@36f8e30
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd332d3f960,position=0,limit=0,capacity=0,deallocator=null],count=0]
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddres s=0x7fd33fda20a0,deallocatorAddress=0x7fd8c1469fe0]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Closing org.bytedeco.javacpp.PointerScope@1dfc3697
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd730001d60,deallocatorAddress=0x7fd6e93de420]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd730001d40,deallocatorAddress=0x7fd6e93de520]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7fd730001b00,deallocatorAddress=0x7fd6e93de420]
Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Buffer$DeleteDeallocator[address=0x7fd730001d20,position=0,limit=0,capacity=0,deallocator=null],count=0]
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_SessionOptions$DeleteDeallocator[address=0x7fd7300011f0,position=0,limit=0,capacity=0,deallocator=null],count=0]
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd7300011d0,position=0,limit=0,capacity=0,deallocator=null],count=0]
start releasing old model
closing tensorflow model
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Session$DeleteDeallocator[address=0x7fd3e06927f0,position=0,limit=0,capacity=0,deallocator=null],count=0]
Debug: Opening org.bytedeco.javacpp.PointerScope@7916eb60
Debug: Registering org.tensorflow.internal.c_api.TF_Status[address=0x7fd800003410,position=0,limit=0,capacity=0,deallocator=org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd800003410,position=0,limit=0,capacity=0,deallocator=null],count=0]]
Debug: Attaching org.tensorflow.internal.c_api.TF_Status[address=0x7fd800003410,position=0,limit=0,capacity=0,deallocator=org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd800003410,position=0,limit=0,capacity=0,deallocator=null],count=0]] to org.bytedeco.javacpp.PointerScope@7916eb60
finish releasing old model
Debug: Closing org.bytedeco.javacpp.PointerScope@7916eb60
Debug: Releasing org.bytedeco.javacpp.Pointer$DeallocatorReference[deallocator=org.tensorflow.internal.c_api.AbstractTF_Status$DeleteDeallocator[address=0x7fd800003410,position=0,limit=0,capacity=0,deallocator=null],count=0]

I have seen keyword Releasing ... AbstractTF_Session$DeleteDeallocator in the logging but the capacity is 0 and deallocator is null, is this expected behavior? And I cannot find any logging for releasing the TF graph. As the oldBundle.close should also call graph.close

public void close() {
  session.close();
  graph.close();
}  

zhenglaizhang avatar Aug 01 '22 08:08 zhenglaizhang

We can "close" objects that don't own anything, that's fine, it just won't do anything. There are also no "Collecting" messages with an address != 0, so it looks like everything that JavaCPP manages gets closed explicitly. Consequently, I would say the issue you're experiencing is probably not related to either TF or JavaCPP.

saudet avatar Aug 01 '22 10:08 saudet

Thanks @saudet for your clarification. Yes based on my previous experiments, it should be related with G1GC behavior and heap objects are most concerning

zhenglaizhang avatar Aug 01 '22 13:08 zhenglaizhang

@saudet curious if there are some metrics to expose the native memory usage by javacpp or some approach to roughly estimate the native mem allocated by javacpp. We will run a few TF models in same JVM process in a k8s pod, so understanding the native memory usage is very important for us to determine the mem limit for the k8s pod.

zhenglaizhang avatar Aug 02 '22 15:08 zhenglaizhang

JavaCPP can't know what TensorFlow does, we need to query it for that. IIRC, there are a few functions like that, but nothing in the public API. Maybe this has changed recently though.

saudet avatar Aug 03 '22 07:08 saudet

@saudet thanks for reply, btw, I have a estimation solution - I will split the JVM mem into 3 parts: JVM heap + JVM non-heap + Other memory And I will count the native memory allocated by javacpp in the Other memory part, so that we could have a rough estimate of the native mem used by our JVM. Does it make sense for javacpp?

zhenglaizhang avatar Aug 03 '22 07:08 zhenglaizhang