self-hosted
self-hosted copied to clipboard
Defunct java processes after upgrading to 23.10.1
Environment
self-hosted (https://develop.sentry.dev/self-hosted/)
Steps to Reproduce
- Upgraded from 23.6.2 to 23.10.1.
- Ran "ps awwlx | grep defunct".
- Several java processes are now showing as defunct.
Expected Result
No defunct java processes.
Actual Result
We have two self-hosted Sentry using docker compose, and with our upgrade to 23.10.1 our monitoring on the host started reporting defunct processes. It looks like these are Kafka related processes.
One system:
# ps awwlx | grep 1079277
0 0 1000210 999907 20 0 9032 720 pipe_w S+ pts/0 0:00 grep --color=auto 1079277
4 0 1079277 1078583 20 0 5880368 641220 futex_ Ssl ? 2060:08 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4 0 1092665 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1096955 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1098576 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1100465 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1102716 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1104822 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1106544 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
0 0 1110312 1079277 20 0 0 0 - Z ? 0:03 [java] <defunct>
The other system:
# ps -awwlx | grep 2554
4 0 2554 2466 20 0 7902972 1255820 futex_ Ssl ? 1706:40 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4 0 6881 2554 20 0 0 0 - Z ? 0:05 [java] <defunct>
0 0 19774 2554 20 0 0 0 - Z ? 0:05 [java] <defunct>
0 0 20656 2554 20 0 0 0 - Z ? 0:04 [java] <defunct>
0 0 21318 2554 20 0 0 0 - Z ? 0:05 [java] <defunct>
0 0 2900197 2900049 20 0 9032 720 pipe_w S+ pts/0 0:00 grep --color=auto 2554
I have a 30MB log file from /var/log/kafka in the kafka container but don't see anything particularly interesting in it. Here's a snippet:
OpenJDK 64-Bit Server VM (25.222-b10) for linux-amd64 JRE (Zulu 8.40.0.25-CA-linux64) (1.8.0_222-b10), built on Jul 11 2019 11:36:39 by "zulu_re" with gcc 4.4.7 20120313 (Red Hat 4.4.7-3)
Memory: 4k page, physical 16380436k(7895772k free), swap 1046504k(1046504k free)
CommandLine flags: -XX:+ExplicitGCInvokesConcurrent -XX:GCLogFileSize=104857600 -XX:InitialHeapSize=1073741824 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ManagementServer -XX:MaxGCPauseMillis=20 -XX:MaxHeapSize=1073741824 -XX:MaxInlineLevel=15 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation
2023-10-27T21:01:25.667+0000: 3.328: [GC pause (G1 Evacuation Pause) (young), 0.0408966 secs]
[Parallel Time: 16.8 ms, GC Workers: 8]
[GC Worker Start (ms): Min: 3330.2, Avg: 3334.9, Max: 3342.1, Diff: 11.9]
[Ext Root Scanning (ms): Min: 0.0, Avg: 0.8, Max: 3.6, Diff: 3.6, Sum: 6.3]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.5, Diff: 0.5, Sum: 1.0]
[Object Copy (ms): Min: 0.0, Avg: 6.2, Max: 9.9, Diff: 9.9, Sum: 49.7]
[Termination (ms): Min: 0.0, Avg: 1.3, Max: 4.6, Diff: 4.6, Sum: 10.2]
[Termination Attempts: Min: 1, Avg: 7.8, Max: 16, Diff: 15, Sum: 62]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
[GC Worker Total (ms): Min: 0.9, Avg: 8.4, Max: 12.7, Diff: 11.8, Sum: 67.6]
[GC Worker End (ms): Min: 3342.9, Avg: 3343.3, Max: 3346.7, Diff: 3.8]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.4 ms]
[Other: 23.6 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 20.4 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 2.2 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.2 ms]
[Eden: 52224.0K(52224.0K)->0.0B(45056.0K) Survivors: 0.0B->7168.0K Heap: 52224.0K(1024.0M)->9408.0K(1024.0M)]
[Times: user=0.07 sys=0.02, real=0.04 secs]
2023-10-27T21:01:25.963+0000: 3.623: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0404999 secs]
[Parallel Time: 23.6 ms, GC Workers: 8]
[GC Worker Start (ms): Min: 3623.8, Avg: 3624.8, Max: 3625.4, Diff: 1.7]
[Ext Root Scanning (ms): Min: 0.2, Avg: 4.1, Max: 9.3, Diff: 9.2, Sum: 32.9]
[Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.5]
[Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 2]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.9, Diff: 0.9, Sum: 0.9]
[Object Copy (ms): Min: 0.0, Avg: 2.6, Max: 5.6, Diff: 5.6, Sum: 20.6]
[Termination (ms): Min: 0.0, Avg: 4.6, Max: 16.2, Diff: 16.2, Sum: 36.5]
[Termination Attempts: Min: 1, Avg: 29.9, Max: 91, Diff: 90, Sum: 239]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[GC Worker Total (ms): Min: 9.3, Avg: 11.4, Max: 22.0, Diff: 12.7, Sum: 91.6]
[GC Worker End (ms): Min: 3634.7, Avg: 3636.2, Max: 3646.8, Diff: 12.0]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 4.6 ms]
[Other: 12.3 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 11.0 ms]
Everything seems to be working fine, there are just a few defunct processes. It has been running 2 weeks since the update with no problems and no additional defunct processes.
Product Area
Other
Link
No response
DSN
No response
Version
23.10.1
Assigning to @getsentry/support for routing ⏲️
Thanks for bringing this up. We seem to be seeing this as well, but will need to take time to investigate. Going to backlog this.
I've seen this on 23.11.2 also. Seems to be triggered by too many processes fighting for resources during startup. Throwing more CPU cores at the VM results in fewer zombies after startup.
$ ps aux | grep 'Z'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 89379 1.0 0.0 0 0 ? Z 20:35 0:03 [java] <defunct>
root 100867 1.1 0.0 0 0 ? Z 20:35 0:03 [java] <defunct>
root 101220 1.2 0.0 0 0 ? Z 20:35 0:03 [java] <defunct>
same issue after update to 23.11.2. Restart or reboot not help with it
Same issue here. Seems that zombie processes are related to io.confluent.support.metrics.SupportedKafka
and the number of such processes grows over time
ps axf output:
2332779 ? Ssl 1:50 \_ java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokes
Concurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -X
X:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxr
emote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp
/usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java
/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
2333049 ? Z 0:08 | \_ [java] <defunct>
2339805 ? Z 0:03 | \_ [java] <defunct>
2363937 ? Z 0:03 | \_ [java] <defunct>
2366252 ? Z 0:13 | \_ [java] <defunct>
2366672 ? Z 0:13 | \_ [java] <defunct>
2367186 ? Z 0:12 | \_ [java] <defunct>
2367759 ? Z 0:10 | \_ [java] <defunct>
2369610 ? Z 0:09 | \_ [java] <defunct>
2374115 ? Z 0:07 | \_ [java] <defunct>
2374789 ? Z 0:05 | \_ [java] <defunct>
2375282 ? Z 0:05 | \_ [java] <defunct>
2375845 ? Z 0:03 | \_ [java] <defunct>
2376271 ? Z 0:03 | \_ [java] <defunct>
2377949 ? Z 0:02 | \_ [java] <defunct>
2383236 ? Z 0:03 | \_ [java] <defunct>
2388878 ? Z 0:13 | \_ [java] <defunct>
2390177 ? Z 0:15 | \_ [java] <defunct>
2392281 ? Z 0:12 | \_ [java] <defunct>
2395285 ? Z 0:13 | \_ [java] <defunct>
2396333 ? Z 0:11 | \_ [java] <defunct>
2396946 ? Z 0:09 | \_ [java] <defunct>
2397393 ? Z 0:07 | \_ [java] <defunct>
2398811 ? Z 0:05 | \_ [java] <defunct>
2400486 ? Z 0:04 | \_ [java] <defunct>
2403333 ? Z 0:03 | \_ [java] <defunct>
2404875 ? Z 0:02 | \_ [java] <defunct>
2405871 ? Z 0:02 | \_ [java] <defunct>
2416303 ? Z 0:14 | \_ [java] <defunct>
2418205 ? Z 0:15 | \_ [java] <defunct>
yes, this week sentry stop accept errors and transactions. On server I see
There are 37 zombie processes.
So this is very bad issue. Sentry is latest - 24.2.0
same here Sentry 24.2.0
ok, looks like I found where this issue came from. You need replace in docker-compose.yml:
test: ["CMD-SHELL", "/usr/bin/kafka-topics --bootstrap-server kafka:9092 --list"]
to old one:
test: ["CMD-SHELL", "nc -z localhost 9092"]
after this start of docker compose stop generating defunct java processes
Change introduced by https://github.com/getsentry/self-hosted/commit/69f973a30a3153ac73731377eabf469e298fd562
Super good find @le0pard, if that's the case we should revert the change that was introduced above.
@chris-laack @mkhon @linsomniac Could you try changing your docker compose healthcheck for kafka to
test: ["CMD-SHELL", "nc -z localhost 9092"]
@hubertdeng123 i've just come back from vacation. i'll try it out today and let you know.
works for us.
I've put up a PR to address this