self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Defunct java processes after upgrading to 23.10.1

Open linsomniac opened this issue 8 months ago • 4 comments

Environment

self-hosted (https://develop.sentry.dev/self-hosted/)

Steps to Reproduce

  1. Upgraded from 23.6.2 to 23.10.1.
  2. Ran "ps awwlx | grep defunct".
  3. Several java processes are now showing as defunct.

Expected Result

No defunct java processes.

Actual Result

We have two self-hosted Sentry using docker compose, and with our upgrade to 23.10.1 our monitoring on the host started reporting defunct processes. It looks like these are Kafka related processes.

One system:

# ps awwlx | grep 1079277
0     0 1000210  999907  20   0   9032   720 pipe_w S+   pts/0      0:00 grep --color=auto 1079277
4     0 1079277 1078583  20   0 5880368 641220 futex_ Ssl ?       2060:08 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4     0 1092665 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1096955 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1098576 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1100465 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1102716 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1104822 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1106544 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1110312 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>

The other system:

# ps -awwlx | grep 2554
4     0    2554    2466  20   0 7902972 1255820 futex_ Ssl ?      1706:40 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4     0    6881    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0   19774    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0   20656    2554  20   0      0     0 -      Z    ?          0:04 [java] <defunct>
0     0   21318    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0 2900197 2900049  20   0   9032   720 pipe_w S+   pts/0      0:00 grep --color=auto 2554

I have a 30MB log file from /var/log/kafka in the kafka container but don't see anything particularly interesting in it. Here's a snippet:

OpenJDK 64-Bit Server VM (25.222-b10) for linux-amd64 JRE (Zulu 8.40.0.25-CA-linux64) (1.8.0_222-b10), built on Jul 11 2019 11:36:39 by "zulu_re" with gcc 4.4.7 20120313 (Red Hat 4.4.7-3)
Memory: 4k page, physical 16380436k(7895772k free), swap 1046504k(1046504k free)
CommandLine flags: -XX:+ExplicitGCInvokesConcurrent -XX:GCLogFileSize=104857600 -XX:InitialHeapSize=1073741824 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ManagementServer -XX:MaxGCPauseMillis=20 -XX:MaxHeapSize=1073741824 -XX:MaxInlineLevel=15 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation
2023-10-27T21:01:25.667+0000: 3.328: [GC pause (G1 Evacuation Pause) (young), 0.0408966 secs]
   [Parallel Time: 16.8 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 3330.2, Avg: 3334.9, Max: 3342.1, Diff: 11.9]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.8, Max: 3.6, Diff: 3.6, Sum: 6.3]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.5, Diff: 0.5, Sum: 1.0]
      [Object Copy (ms): Min: 0.0, Avg: 6.2, Max: 9.9, Diff: 9.9, Sum: 49.7]
      [Termination (ms): Min: 0.0, Avg: 1.3, Max: 4.6, Diff: 4.6, Sum: 10.2]
         [Termination Attempts: Min: 1, Avg: 7.8, Max: 16, Diff: 15, Sum: 62]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 0.9, Avg: 8.4, Max: 12.7, Diff: 11.8, Sum: 67.6]
      [GC Worker End (ms): Min: 3342.9, Avg: 3343.3, Max: 3346.7, Diff: 3.8]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.4 ms]
   [Other: 23.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 20.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.2 ms]
   [Eden: 52224.0K(52224.0K)->0.0B(45056.0K) Survivors: 0.0B->7168.0K Heap: 52224.0K(1024.0M)->9408.0K(1024.0M)]
 [Times: user=0.07 sys=0.02, real=0.04 secs]
2023-10-27T21:01:25.963+0000: 3.623: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0404999 secs]
   [Parallel Time: 23.6 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 3623.8, Avg: 3624.8, Max: 3625.4, Diff: 1.7]
      [Ext Root Scanning (ms): Min: 0.2, Avg: 4.1, Max: 9.3, Diff: 9.2, Sum: 32.9]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.5]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 2]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.9, Diff: 0.9, Sum: 0.9]
      [Object Copy (ms): Min: 0.0, Avg: 2.6, Max: 5.6, Diff: 5.6, Sum: 20.6]
      [Termination (ms): Min: 0.0, Avg: 4.6, Max: 16.2, Diff: 16.2, Sum: 36.5]
         [Termination Attempts: Min: 1, Avg: 29.9, Max: 91, Diff: 90, Sum: 239]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 9.3, Avg: 11.4, Max: 22.0, Diff: 12.7, Sum: 91.6]
      [GC Worker End (ms): Min: 3634.7, Avg: 3636.2, Max: 3646.8, Diff: 12.0]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 4.6 ms]
   [Other: 12.3 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 11.0 ms]

Everything seems to be working fine, there are just a few defunct processes. It has been running 2 weeks since the update with no problems and no additional defunct processes.

Product Area

Other

Link

No response

DSN

No response

Version

23.10.1

linsomniac avatar Nov 13 '23 12:11 linsomniac

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] avatar Nov 13 '23 12:11 getsantry[bot]

Thanks for bringing this up. We seem to be seeing this as well, but will need to take time to investigate. Going to backlog this.

hubertdeng123 avatar Nov 14 '23 21:11 hubertdeng123

I've seen this on 23.11.2 also. Seems to be triggered by too many processes fighting for resources during startup. Throwing more CPU cores at the VM results in fewer zombies after startup.

e100 avatar Dec 08 '23 21:12 e100

$ ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       89379  1.0  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>
root      100867  1.1  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>
root      101220  1.2  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>

same issue after update to 23.11.2. Restart or reboot not help with it

le0pard avatar Jan 17 '24 20:01 le0pard

Same issue here. Seems that zombie processes are related to io.confluent.support.metrics.SupportedKafka and the number of such processes grows over time

ps axf output:

2332779 ?        Ssl    1:50  \_ java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokes
Concurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -X
X:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxr
emote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp 
/usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java
/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
2333049 ?        Z      0:08  |   \_ [java] <defunct>
2339805 ?        Z      0:03  |   \_ [java] <defunct>
2363937 ?        Z      0:03  |   \_ [java] <defunct>
2366252 ?        Z      0:13  |   \_ [java] <defunct>
2366672 ?        Z      0:13  |   \_ [java] <defunct>
2367186 ?        Z      0:12  |   \_ [java] <defunct>
2367759 ?        Z      0:10  |   \_ [java] <defunct>
2369610 ?        Z      0:09  |   \_ [java] <defunct>
2374115 ?        Z      0:07  |   \_ [java] <defunct>
2374789 ?        Z      0:05  |   \_ [java] <defunct>
2375282 ?        Z      0:05  |   \_ [java] <defunct>
2375845 ?        Z      0:03  |   \_ [java] <defunct>
2376271 ?        Z      0:03  |   \_ [java] <defunct>
2377949 ?        Z      0:02  |   \_ [java] <defunct>
2383236 ?        Z      0:03  |   \_ [java] <defunct>
2388878 ?        Z      0:13  |   \_ [java] <defunct>
2390177 ?        Z      0:15  |   \_ [java] <defunct>
2392281 ?        Z      0:12  |   \_ [java] <defunct>
2395285 ?        Z      0:13  |   \_ [java] <defunct>
2396333 ?        Z      0:11  |   \_ [java] <defunct>
2396946 ?        Z      0:09  |   \_ [java] <defunct>
2397393 ?        Z      0:07  |   \_ [java] <defunct>
2398811 ?        Z      0:05  |   \_ [java] <defunct>
2400486 ?        Z      0:04  |   \_ [java] <defunct>
2403333 ?        Z      0:03  |   \_ [java] <defunct>
2404875 ?        Z      0:02  |   \_ [java] <defunct>
2405871 ?        Z      0:02  |   \_ [java] <defunct>
2416303 ?        Z      0:14  |   \_ [java] <defunct>
2418205 ?        Z      0:15  |   \_ [java] <defunct>

mkhon avatar Feb 26 '24 15:02 mkhon

yes, this week sentry stop accept errors and transactions. On server I see

There are 37 zombie processes.

So this is very bad issue. Sentry is latest - 24.2.0

le0pard avatar Mar 11 '24 10:03 le0pard

same here Sentry 24.2.0

chris-laack avatar Mar 11 '24 10:03 chris-laack

ok, looks like I found where this issue came from. You need replace in docker-compose.yml:

test: ["CMD-SHELL", "/usr/bin/kafka-topics --bootstrap-server kafka:9092 --list"]

to old one:

test: ["CMD-SHELL", "nc -z localhost 9092"]

after this start of docker compose stop generating defunct java processes

Change introduced by https://github.com/getsentry/self-hosted/commit/69f973a30a3153ac73731377eabf469e298fd562

le0pard avatar Mar 14 '24 16:03 le0pard

Super good find @le0pard, if that's the case we should revert the change that was introduced above.

hubertdeng123 avatar Mar 15 '24 21:03 hubertdeng123

@chris-laack @mkhon @linsomniac Could you try changing your docker compose healthcheck for kafka to

test: ["CMD-SHELL", "nc -z localhost 9092"]

hubertdeng123 avatar Mar 15 '24 21:03 hubertdeng123

@hubertdeng123 i've just come back from vacation. i'll try it out today and let you know.

chris-laack avatar Mar 19 '24 07:03 chris-laack

works for us.

chris-laack avatar Mar 20 '24 12:03 chris-laack

I've put up a PR to address this

hubertdeng123 avatar Mar 20 '24 22:03 hubertdeng123