1 vCPU always stuck at 100%
Current Behavior
I'm currently running a dockerized version of Dependency-Track (apiserver+frontend+postgresql) on a homelab and permanently have 1 vCPU stuck at 100% due to the Dependency-Track apiserver java runtime.
I've used strace to attach to the pid and observed that the DT thread seems to be on a infinite loop of sched_yield() calls, with apparently no delays.
Maybe there's an event loop on the DT apiserver code that, for some reason, starts polling for events non-stop, saturating the CPU to 100% usage ?
I've recreated my Dependency-Track instance (data/configs) from scratch to confirm if the issue was on my side. I've seen other people describe the same issues in https://github.com/DependencyTrack/dependency-track/issues/1370, although the original post is not the same behavior I experienced.
My instance of Dependency-Track is only loaded with ~45 project, and I've reproduced this with both the Postgresql and the embedded database modes (to exclude database issues).
Steps to Reproduce
- Create a dockerized setup of dependency-track - I have mine with named (local) volumes, not sure if it matters.
- Check Dependency-Track CPU usage.
Expected Behavior
Dependency-Track should not block any CPU to 100% usage permanently.
Dependency-Track Version
4.13.2
Dependency-Track Distribution
Container Image
Database Server
PostgreSQL
Database Server Version
16.9
Browser
Mozilla Firefox
Checklist
- [x] I have read and understand the contributing guidelines
- [x] I have checked the existing issues for whether this defect was already reported
Quick Google search surfaced this: https://support.hpe.com/hpesc/public/docDisplay?docId=c01937191&docLocale=en_US
This can be caused by too many parallel GC threads. [...] Whenever a parallel GC occurs, all these threads compete for the CPUs, but not all can get onto a CPU because there are other processes and threads running which also want CPU time.
As a result, some of the GC threads cannot run, so they go back to sleep for a short time, try again to get CPU etc. In this situation they are producing a lot of CPU load without actually getting much work done.
Reducing the number of GC threads should solve the issue.
DT uses ParallelGC per default. Your system has 8 CPU cores. The ParallelGC docs say:
On a machine with
<N>hardware threads where<N>is greater than 8, the parallel collector uses a fixed fraction of<N>as the number of garbage collector threads.The fraction is approximately 5/8 for large values of
<N>. At values of<N>below 8, the number used is<N>.
Following this, try forcing fewer GC threads using the EXTRA_JAVA_OPTIONS env var:
EXTRA_JAVA_OPTIONS="-XX:ParallelGCThreads=4"
Also consider trying even fewer threads than 4.
For completeness, I am not observing this behavior on my machine.
When you say:
[...] permanently have 1 vCPU stuck at 100%
do you really mean like 24/7, or just "for a long time"?
There are certain tasks, such as mirroring the NVD, which are CPU-intensive due to all the decompression and JSON parsing, and which may run on a single core until completion. Depending on how fast your machine is, that operation might take a while to complete.
Following this, try forcing fewer GC threads using the
EXTRA_JAVA_OPTIONSenv var:EXTRA_JAVA_OPTIONS="-XX:ParallelGCThreads=4"
Also consider trying even fewer threads than 4.
Thanks for the help, I've tried this, but it the issue is still there (1 CPU stuck at 100%), both with -XX:ParallelGCThreads=1 and -XX:ParallelGCThreads=2.
When you say:
[...] permanently have 1 vCPU stuck at 100%
do you really mean like 24/7, or just "for a long time"?
I mean 24/7, as long as dependecy-track is running, I have 1 CPU always stuck, it's weird. I also have oidc auth enabled, I've disabled it to see if it had any impact, but no, the issue is still there.
The process is being launched with this args:
java -XX:+UseParallelGC-XX:+UseStringDeduplication -XX:MaxRAMPercentage=90.0 -XX:ParallelGCThreads=2 -Dalpine.application.properties=/config/application.properties --add-opensjava.base/java.util.concurrent=ALL-UNNAMED -Dlogback.configurationFile=/dev/stdout -DdependencyTrack.logging.level=WARN -jardependency-track-apiserver.jar-context
My config is pretty standard, but I'll try to create a proof-of-concept docker-compose stack + config where I can see the problem (probably during the weekend), to see if this reproducible on other environments .
A reproducer would be great! I assume you already checked the container logs for any errors or otherwise suspicious behavior?
A reproducer would be great! I assume you already checked the container logs for any errors or otherwise suspicious behavior?
Yes, I haven't seen anything that would explain this endless looping, everything seems normal, although I'm not sure exactly what to look for.
I'll try to have a minimal reproducible setup, maybe that's the best way, since I got no logs or any useful output other than strace sched_yield() calls.
You can also try pulling a thread dump so we can see what the Java threads are doing. This is a bit more involved since our images only ship with the JRE, and the tools needed for thread dumps only come with the full-blown JDK.
docker exec -it -u 0 <apiserver-container-id>
cd /opt
wget -O jdk.tar.gz --no-check-certificate 'https://api.adoptium.net/v3/binary/version/jdk-21.0.7%2B6/linux/x64/jdk/hotspot/normal/eclipse?project=jdk'
tar -xzf jdk.tar.gz
cd jdk-21.0.7+6
apt install -y sudo
sudo -u dtrack ./bin/jhsdb jstack --pid 1
This should yield a stack trace for each thread in the application.
Quick update, so I've tracked the problem to the alpine.watchdog.logging.interval option.
I've set this to alpine.watchdog.logging.interval = 300 in my config, and this seems to be causing the issue.
Once I disabled this (e.g. alpine.watchdog.logging.interval = 0) I no longer have any issues.
I've tried with other values (e.g. 1 and 5), and any value above 0 triggers the issue.
Got the stack trace for the problematic thread, seems to point to org.owasp.security.logging.util.DefaultIntervalLoggerController.run()
"Thread-22" #66 prio=5 tid=0x00007cffa9df8ca0 nid=89 runnable [0x00007cff298fb000]
java.lang.Thread.State: RUNNABLE
JavaThread state: _thread_in_native
- java.lang.Thread.yield0() @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Thread.yield() @bci=23 (Compiled frame)
- org.owasp.security.logging.util.DefaultIntervalLoggerController.run() @bci=25, line=91 (Compiled frame)
- java.lang.Thread.runWith(java.lang.Object, java.lang.Runnable) @bci=5 (Interpreted frame)
- java.lang.Thread.run() @bci=19 (Interpreted frame)
Interesting, I never used the watchdog logger, and it defaults to 0 (disabled). I'll try if I can reproduce this by assigning a value >0.
But yeah, looking at the code, your observation makes sense. It is busy-looping...
I am inclined to just remove the watchdog altogether. There are better means such as the /metrics and /health endpoints to monitor the system status now. Seems to be kind of a footgun to leave it in like this.
I only had this enabled to increase the reliability of the application, but yes, the /health and /metrics are more modern approaches to solve this, and easier to integrate with external monitoring systems.
If it makes sense, maybe /health can be extended in the future to check the logging health, since a user can configure logging paths via LOGGING_CONFIG_PATH ?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.