javacpp-presets icon indicating copy to clipboard operation
javacpp-presets copied to clipboard

about Memory growth test of Tritonserver

Open jackyh opened this issue 3 years ago • 109 comments

We try to test the memory growth by gather the stats of memory usage when doing inference. each time when we do an inference, we will get the statistics of memory it allocated. we found that: ("The max allocation of Memory when doing a single inference" - "The average allocation of Memory when doing a single inference") / ("The max allocation of Memory when doing a single inference") = 0.46, which means the variation is too big, why? it varies from about 700MB to 1500MB.

attached is the simple.java file and the test.sh script, probably to reproduce this, one need to modify the dir in the test.sh accordingly.

@saudet

L0_memory_growth.zip

jackyh avatar Jan 28 '22 04:01 jackyh

There's a couple of things that could be happening, but the first thing you should check is for dangling Pointer objects. Try to run a command like this:

mvn clean compile exec:java -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx1000m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'

If you see any output from that, you should find where those Pointer objects are not getting deallocated and call close() on them, or you could use PointerScope where appropriate: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

saudet avatar Jan 28 '22 05:01 saudet

looks like there's quite a few there. Since there's lots of "new BytePointer" in Simple.java, which ones need me to do a close() on it? client_10.log

jackyh avatar Jan 28 '22 14:01 jackyh

Just try to use PointerScope...

saudet avatar Jan 28 '22 14:01 saudet

it will help to release/close by self?

jackyh avatar Jan 28 '22 14:01 jackyh

Kind of, it's like a scope in C++, see the example here: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

saudet avatar Jan 28 '22 14:01 saudet

Kind of, it's like a scope in C++, see the example here: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

I did some test today with "PointerScope" as attached. But memory usage is still dangled, vibrated: ("The max allocation of Memory when doing a single inference" - "The average allocation of Memory when doing a single inference") / ("The max allocation of Memory when doing a single inference") = 0.52, which means the variation is still too big. it varies from about 50MB to 500MB. Do we have more ways to debug? 20220129_PointerScope.zip

jackyh avatar Jan 29 '22 10:01 jackyh

Please check the debug log like I asked you to do above https://github.com/bytedeco/javacpp-presets/issues/1141#issuecomment-1023895781

saudet avatar Jan 29 '22 11:01 saudet

Please check the debug log like I asked you to do above #1141 (comment)

looks you mean even if "PointerScope" added, there still will be more leaks of pointers, if "PointerScope" does not include all the pointers?

jackyh avatar Jan 29 '22 11:01 jackyh

I added "try (PointerScope scope = new PointerScope()) {" just at the beginning of the Main function, why there's still lots of "Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0]" in the output log file?

jackyh avatar Jan 29 '22 15:01 jackyh

Those are fine, their ownerAddress is 0. If you see any that have an address other than 0, then you should find what those are. If all that you see do not have an address, then you're probably dealing with GC issues of the Java heap. Try a different one: https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector

saudet avatar Jan 30 '22 00:01 saudet

BTW, how did you make sure this is happening only with Java, and not with C++? Maybe it's a problem with Triton...

saudet avatar Jan 30 '22 04:01 saudet

BTW, how did you make sure this is happening only with Java, and not with C++? Maybe it's a problem with Triton...

good point.

jackyh avatar Jan 30 '22 13:01 jackyh

searched the log file: Debug: Releasing org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x7f29bf518190,deallocatorAddress=0x7f29c7ec4090] Debug: Collecting org.bytedeco.javacpp.Pointer$NativeDeallocator[ownerAddress=0x0,deallocatorAddress=0x0] all the log of Collecting are ownerAddress=0x0

jackyh avatar Jan 30 '22 13:01 jackyh

Samuel: We designed our test case like this:

  1. we will start a thread to monitor the memory usage: a, this thread will monitor the usage of memory every two seconds b, each time, we will use this to do statistics: DoubleSummaryStatistics stats = new DoubleSummaryStatistics(); double memory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(); stats.accept(memory); c, each time, we will use this to calculate the delta: double memory_allocation_delta = stats.getMax() - stats.getAverage(); double memory_allocation_delta_mb = memory_allocation_delta / 1E6; double memory_allocation_delta_percent = memory_allocation_delta / stats.getMax(); here, if the "memory_allocation_delta_percent" is larger than 10 percent, then the test will fail.

  2. for the main process, we will do this: for(int i = 0; i < 1000000; i++){ RunInference(server, model_name, is_int, is_torch_model); }

We assume, every two seconds, there will be some member of function "RunInference" be processed, some memory will be allocated, some memory will be freed during each call of "RunInference", so the variation of "memory_allocation_delta_percent" should not be larger than 10%, how do you think? This is the right way to test memory growth of Java process?

@saudet

jackyh avatar Feb 03 '22 13:02 jackyh

Well, that's a question about Triton more than anything else, I think. All buffers should be preallocated as much as possible, so variations like that don't occur.

saudet avatar Feb 03 '22 21:02 saudet

Well, that's a question about Triton more than anything else, I think. All buffers should be preallocated as much as possible, so variations like that don't occur.

since, each time when to do the reference, in function of "RunInference", we will allocator lots of buffers, do you mean this needs to be replaced by some static/preallocated memory? If we re-allocate these buffers each time when doing inference, this variation is common to Java process?

jackyh avatar Feb 04 '22 13:02 jackyh

That has nothing to do with Java! You're allocating these buffers for Triton, not Java. This is something that needs to be fixed for Triton.

saudet avatar Feb 04 '22 13:02 saudet

buffers for Triton, not Java. This is something that needs to be fixed for Triton.

Yes, we allocate these buffers for Triton to do inference or compare some result. So, let's say, if I allocate these buffers as some static/preallocated ones, then the variation issue is fixed, does that mean the GC is not working well enough?

jackyh avatar Feb 04 '22 13:02 jackyh

Preallocating and reusing objects that use memory on the Java heap helps the GC, but it's possible to tune the GC to be able to cope better with larger amounts of garbage too, yes.

saudet avatar Feb 04 '22 22:02 saudet

Preallocating and reusing objects that use memory on the Java heap helps the GC, but it's possible to tune the GC to be able to cope better with larger amounts of garbage too, yes.

so you mean the ways listed https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector#parallel_collector here to tune the GC for larger amounts of garbage?

jackyh avatar Feb 05 '22 13:02 jackyh

That kind thing, yes, but if the requests that you get don't require allocating different kinds of buffers all the time, it's more efficient to just reuse those buffers. That's probably what your users are asking about.

saudet avatar Feb 05 '22 14:02 saudet

here's the default JVM parameters:

root@4a42d065cf6e:/workspace/javacpp_presets_upstream/javacpp-presets/tritonserver# java -XX:+PrintCommandLineFlags -version -XX:G1ConcRefinementThreads=10 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=524877248 -XX:MaxHeapSize=8398035968 -XX:+PrintCommandLineFlags -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC openjdk version "11.0.13" 2021-10-19 OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04) OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Which parameters you think that probably need to tune?

jackyh avatar Feb 06 '22 10:02 jackyh

That kind thing, yes, but if the requests that you get don't require allocating different kinds of buffers all the time, it's more efficient to just reuse those buffers. That's probably what your users are asking about.

so, for this, since I want to make the largest allocated memory as static ones, how can I know which buffers/object is the largest one?

jackyh avatar Feb 06 '22 10:02 jackyh

Not just the largest one, all of them, if possible. I'm guessing that ideally your users want this to be "garbage free" to get the lowest latency possible, for real time applications, but I'm just guessing. You should try to find out what the needs of your users are, and then we can figure out how to meet those needs.

saudet avatar Feb 06 '22 11:02 saudet

Not just the largest one, all of them, if possible. I'm guessing that ideally your users want this to be "garbage free" to get the lowest latency possible, for real time applications, but I'm just guessing. You should try to find out what the needs of your users are, and then we can figure out how to meet those needs.

temporary, this test is just internally, probably users will have such sort of requirements? I'm not sure what JAVA users will most care about

jackyh avatar Feb 06 '22 13:02 jackyh

temporary, this test is just internally, probably users will have such sort of requirements? I'm not sure what JAVA users will most care about

Well, if what you care most about is money, HFT is where it's at for low-latency Java applications: https://www.efinancialcareers.com/news/2020/11/low-latency-java-trading-systems https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-frequency-trading-application-600f7c04da94 https://www.azul.com/use-cases/trading-risk/ https://github.com/OpenHFT

But personally I prefer working on embedded systems such as the ones from aicas: https://www.aicas.com/wp/use-cases/ @jjh-aicas Do you have use cases where machine learning and GPUs could be of help?

saudet avatar Feb 07 '22 00:02 saudet

Samuel:

Today I did more tests on GC and Heap:

  1. Command line arg is: -DargLine=-Xmx1000m
  2. While the test is running, Memory allocated (heap) will grow gradually from about 60M to 4000M (Here: Memory allocated is calculated by: Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(). Detail is attached as client.log
  3. info of gc is collected by cmd: jstat -gc 10524 500. Looks like "OU" grows fast! Detail is attached as gc.log

Why "OU" grows fast here? @saudet client.log gc.log

jackyh avatar Feb 07 '22 13:02 jackyh

That's the "old space" apparently: https://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html Here's some doc about that: https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html So it just looks like there are buffers that can't be freed because they are still referenced somewhere.

saudet avatar Feb 07 '22 14:02 saudet

That's the "old space" apparently: https://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html Here's some doc about that: https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html So it just looks like there are buffers that can't be freed because they are still referenced somewhere.

Then how can I quickly locate which APIs/calls still refer these buffers?

jackyh avatar Feb 08 '22 03:02 jackyh

Flight Recorder can usually help with that: https://docs.oracle.com/javase/9/troubleshoot/troubleshoot-memory-leaks.htm#JSTGD271 https://developers.redhat.com/blog/2020/08/25/get-started-with-jdk-flight-recorder-in-openjdk-8u

saudet avatar Feb 08 '22 03:02 saudet