dd-trace-java
dd-trace-java copied to clipboard
Continuously increasing off-heap 'Tracing' memory
We use the latest Datadog Java Tracer (fetched from https://dtdg.co/latest-java-tracer) included in our Java/SpringBoot application using the following JVM parameters
-javaagent:/opt/dd-java-agent.jar
-Ddd.profiling.enabled=true
-XX:FlightRecorderOptions=stackdepth=256
-Ddd.logs.injection=true
-Ddd.trace.sample.rate=1
java -version
openjdk version "18.0.2" 2022-07-19
OpenJDK Runtime Environment (build 18.0.2+9-61)
OpenJDK 64-Bit Server VM (build 18.0.2+9-61, mixed mode, sharing)
The application is running within a Docker container in AWS Elastic Beanstalk.
We observed, that docker.mem.rss
is continuously increasing over time, whereas jvm.heap_memory
and jvm.non_heap_memory
stay constant (after ~1d of 'warm-up' period). After ~10-15 days, the container RSS reaches a configured memory limit and the container is killed and restarted.
Further investigation (using java native memory tracking) revealed, that it is the off-heap memory area called 'Tracing' that gets bigger and bigger over time. We observed up to ~130MB of allocated memory in that area after ~10 days.
With -Ddd.profiling.enabled=false
the problem does not occur ('Tracing' memory stays constant at 32KB).
In the Datadog Agents (v7.38.2, Docker) logs we see no obvious problems (except lots of CPU threshold exceeded
warnings).
What can we do to prevent this 'Tracing' memory leak with activated profiling?
Hello, thank you for reporting this. Datadog continuous Java profiler is using JFR behind the scene it seems that where the leak would happen.
I have filed and OpenJDK ticket tracking this problem.
The underlying issue was fixed in OpenJDK 20. Considering that JDK 18 is not an LTS version it is very likely that the fix won't get backported, though :(
Thank you for the information and for your support. We will see how we can handle the problem until the release of OpenJDK 20.
Hi @torstenmandry we have run into the same issue. Did you somehow find a way to make it work. Or are you stuck on jdk 17 ?
@OleBilleAtBS the memory leak is not present unless you override the default stack depth with -XX:FlightRecorderOptions=stackdepth=<depth>
though using the default may result in an unacceptable number of truncated stack frames.
We are working on releasing a new CPU profiler which does not rely on JFR, so bypasses this bug. If you want to try it you can enable it with -Ddd.profiling.async.enabled=true
but be sure to test this thoroughly in a staging environment as it is not GA yet.
@richardstartin Nice! 👍
Hey team, could you add the profiling label to this issue?
It seems that async profiling is enabled by default since 1.3.0 and switched to https://github.com/DataDog/java-profiler since 1.5.0. Does that mean this is effectively "fixed"?
We are seeing this same behavior in version 1.10.0
. Disabling profiling via -Ddd.profiling.enabled=false
prevents the container from steadily increasing memory consumption. We are planning on rolling back to 1.9.0
so that we can re-enable profiling. Is this a known issue in the latest release?
We are seeing this same behavior in version
1.10.0
. Disabling profiling via-Ddd.profiling.enabled=false
prevents the container from steadily increasing memory consumption. We are planning on rolling back to1.9.0
so that we can re-enable profiling. Is this a known issue in the latest release?
Hi @donholly could you let us know which JVM version you are on please?
Hi @richardstartin - we're on 17.0.5
@donholly then unfortunately what you have encountered is unrelated to the issue here (which is a memory leak in JFR in JDK18 and JDK19, which we can't do anything about; the best mitigation is to upgrade to JDK20 or go back to JDK17 for now).
I am investigating your report as a priority and will create a new issue.
This issue is caused by JDK-8293167 which affects JFR (built in to the JDK). It can only affect you if you are doing both of:
- running JDK18 or JDK19
- using a non-default JFR stackdepth
There is nothing we can do to resolve this issue in dd-trace-java as it is a bug in OpenJDK. The end-user mitigations are any of the following:
- downgrade to JDK17
- use the default JFR stackdepth
- wait for the fix in JDK20
If you are not using JDK18 or JDK19 and encounter any symptoms of a memory leak, then please report this in a new issue.