dependency-track icon indicating copy to clipboard operation
dependency-track copied to clipboard

1 vCPU always stuck at 100%

Open lmrts opened this issue 6 months ago • 12 comments

Current Behavior

I'm currently running a dockerized version of Dependency-Track (apiserver+frontend+postgresql) on a homelab and permanently have 1 vCPU stuck at 100% due to the Dependency-Track apiserver java runtime.

I've used strace to attach to the pid and observed that the DT thread seems to be on a infinite loop of sched_yield() calls, with apparently no delays. Maybe there's an event loop on the DT apiserver code that, for some reason, starts polling for events non-stop, saturating the CPU to 100% usage ?

I've recreated my Dependency-Track instance (data/configs) from scratch to confirm if the issue was on my side. I've seen other people describe the same issues in https://github.com/DependencyTrack/dependency-track/issues/1370, although the original post is not the same behavior I experienced.

My instance of Dependency-Track is only loaded with ~45 project, and I've reproduced this with both the Postgresql and the embedded database modes (to exclude database issues).

Image Image

Steps to Reproduce

  1. Create a dockerized setup of dependency-track - I have mine with named (local) volumes, not sure if it matters.
  2. Check Dependency-Track CPU usage.

Expected Behavior

Dependency-Track should not block any CPU to 100% usage permanently.

Dependency-Track Version

4.13.2

Dependency-Track Distribution

Container Image

Database Server

PostgreSQL

Database Server Version

16.9

Browser

Mozilla Firefox

Checklist

lmrts avatar Jun 03 '25 20:06 lmrts

Quick Google search surfaced this: https://support.hpe.com/hpesc/public/docDisplay?docId=c01937191&docLocale=en_US

This can be caused by too many parallel GC threads. [...] Whenever a parallel GC occurs, all these threads compete for the CPUs, but not all can get onto a CPU because there are other processes and threads running which also want CPU time.

As a result, some of the GC threads cannot run, so they go back to sleep for a short time, try again to get CPU etc. In this situation they are producing a lot of CPU load without actually getting much work done.

Reducing the number of GC threads should solve the issue.

DT uses ParallelGC per default. Your system has 8 CPU cores. The ParallelGC docs say:

On a machine with <N> hardware threads where <N> is greater than 8, the parallel collector uses a fixed fraction of <N> as the number of garbage collector threads.

The fraction is approximately 5/8 for large values of <N>. At values of <N> below 8, the number used is <N>.

Following this, try forcing fewer GC threads using the EXTRA_JAVA_OPTIONS env var:

EXTRA_JAVA_OPTIONS="-XX:ParallelGCThreads=4"

Also consider trying even fewer threads than 4.

nscuro avatar Jun 03 '25 21:06 nscuro

For completeness, I am not observing this behavior on my machine.

When you say:

[...] permanently have 1 vCPU stuck at 100%

do you really mean like 24/7, or just "for a long time"?

There are certain tasks, such as mirroring the NVD, which are CPU-intensive due to all the decompression and JSON parsing, and which may run on a single core until completion. Depending on how fast your machine is, that operation might take a while to complete.

nscuro avatar Jun 04 '25 12:06 nscuro

Following this, try forcing fewer GC threads using the EXTRA_JAVA_OPTIONS env var:

EXTRA_JAVA_OPTIONS="-XX:ParallelGCThreads=4"

Also consider trying even fewer threads than 4.

Thanks for the help, I've tried this, but it the issue is still there (1 CPU stuck at 100%), both with -XX:ParallelGCThreads=1 and -XX:ParallelGCThreads=2.

When you say:

[...] permanently have 1 vCPU stuck at 100%

do you really mean like 24/7, or just "for a long time"?

I mean 24/7, as long as dependecy-track is running, I have 1 CPU always stuck, it's weird. I also have oidc auth enabled, I've disabled it to see if it had any impact, but no, the issue is still there.

The process is being launched with this args:

java -XX:+UseParallelGC-XX:+UseStringDeduplication -XX:MaxRAMPercentage=90.0 -XX:ParallelGCThreads=2 -Dalpine.application.properties=/config/application.properties --add-opensjava.base/java.util.concurrent=ALL-UNNAMED -Dlogback.configurationFile=/dev/stdout -DdependencyTrack.logging.level=WARN -jardependency-track-apiserver.jar-context

My config is pretty standard, but I'll try to create a proof-of-concept docker-compose stack + config where I can see the problem (probably during the weekend), to see if this reproducible on other environments .

lmrts avatar Jun 04 '25 14:06 lmrts

A reproducer would be great! I assume you already checked the container logs for any errors or otherwise suspicious behavior?

nscuro avatar Jun 04 '25 15:06 nscuro

A reproducer would be great! I assume you already checked the container logs for any errors or otherwise suspicious behavior?

Yes, I haven't seen anything that would explain this endless looping, everything seems normal, although I'm not sure exactly what to look for. I'll try to have a minimal reproducible setup, maybe that's the best way, since I got no logs or any useful output other than strace sched_yield() calls.

lmrts avatar Jun 04 '25 15:06 lmrts

You can also try pulling a thread dump so we can see what the Java threads are doing. This is a bit more involved since our images only ship with the JRE, and the tools needed for thread dumps only come with the full-blown JDK.

docker exec -it -u 0 <apiserver-container-id>
cd /opt
wget -O jdk.tar.gz --no-check-certificate 'https://api.adoptium.net/v3/binary/version/jdk-21.0.7%2B6/linux/x64/jdk/hotspot/normal/eclipse?project=jdk'
tar -xzf jdk.tar.gz
cd jdk-21.0.7+6
apt install -y sudo
sudo -u dtrack ./bin/jhsdb jstack --pid 1

This should yield a stack trace for each thread in the application.

nscuro avatar Jun 04 '25 15:06 nscuro

Quick update, so I've tracked the problem to the alpine.watchdog.logging.interval option. I've set this to alpine.watchdog.logging.interval = 300 in my config, and this seems to be causing the issue. Once I disabled this (e.g. alpine.watchdog.logging.interval = 0) I no longer have any issues.

I've tried with other values (e.g. 1 and 5), and any value above 0 triggers the issue.

lmrts avatar Jun 04 '25 21:06 lmrts

Got the stack trace for the problematic thread, seems to point to org.owasp.security.logging.util.DefaultIntervalLoggerController.run()

"Thread-22" #66 prio=5 tid=0x00007cffa9df8ca0 nid=89 runnable [0x00007cff298fb000]
   java.lang.Thread.State: RUNNABLE
   JavaThread state: _thread_in_native
 - java.lang.Thread.yield0() @bci=0 (Compiled frame; information may be imprecise)
 - java.lang.Thread.yield() @bci=23 (Compiled frame)
 - org.owasp.security.logging.util.DefaultIntervalLoggerController.run() @bci=25, line=91 (Compiled frame)
 - java.lang.Thread.runWith(java.lang.Object, java.lang.Runnable) @bci=5 (Interpreted frame)
 - java.lang.Thread.run() @bci=19 (Interpreted frame)

lmrts avatar Jun 04 '25 21:06 lmrts

Interesting, I never used the watchdog logger, and it defaults to 0 (disabled). I'll try if I can reproduce this by assigning a value >0.

nscuro avatar Jun 05 '25 07:06 nscuro

But yeah, looking at the code, your observation makes sense. It is busy-looping...

nscuro avatar Jun 05 '25 07:06 nscuro

I am inclined to just remove the watchdog altogether. There are better means such as the /metrics and /health endpoints to monitor the system status now. Seems to be kind of a footgun to leave it in like this.

nscuro avatar Jun 05 '25 10:06 nscuro

I only had this enabled to increase the reliability of the application, but yes, the /health and /metrics are more modern approaches to solve this, and easier to integrate with external monitoring systems. If it makes sense, maybe /health can be extended in the future to check the logging health, since a user can configure logging paths via LOGGING_CONFIG_PATH ?

lmrts avatar Jun 05 '25 13:06 lmrts

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Oct 29 '25 10:10 github-actions[bot]