dd-trace-java icon indicating copy to clipboard operation
dd-trace-java copied to clipboard

Live-heap profiler (alpha) in dd-trace-java v1.33.0 crashes the JVM. (After updated from v1.32.0)

Open stevenshim opened this issue 1 year ago • 6 comments

Agent Version dd-java-agent:1.33.0

Describe what happened The JVM intermittently crashes and logs the following error.

Additional environment details Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.6.10.1 AWS EKS 1.25

Log

2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # A fatal error has been detected by the Java Runtime Environment:
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 #  SIGSEGV (0xb) at pc=0x00007f9fba87a340, pid=1, tid=50
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # JRE version: OpenJDK Runtime Environment Corretto-17.0.6.10.1 (17.0.6+10) (build 17.0.6+10-LTS)
2024-04-25T16:52:49+09:00 # Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.6.10.1 (17.0.6+10-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
2024-04-25T16:52:49+09:00 # Problematic frame:
2024-04-25T16:52:49+09:00 # C  [libjavaProfiler8298632684405129428.so+0x40340]  Profiler::recordExternalSample(unsigned long long, int, _jvmtiFrameInfo*, int, bool, int, Event*)+0xb0
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # Core dump will be written. Default location: /opt/goods-detail-api/core.1
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # JFR recording file will be written. Location: /opt/goods-detail-api/hs_err_pid1.jfr
2024-04-25T16:52:49+09:00 #
2024-04-25T16:52:49+09:00 # An error report file with more information is saved as:
2024-04-25T16:52:49+09:00 # /opt/goods-detail-api/hs_err_pid1.log
2024-04-25T16:52:50+09:00 #
2024-04-25T16:52:50+09:00 # If you would like to submit a bug report, please visit:
2024-04-25T16:52:50+09:00 #   https://github.com/corretto/corretto-17/issues/
2024-04-25T16:52:50+09:00 # The crash happened outside the Java Virtual Machine in native code.
2024-04-25T16:52:50+09:00 # See problematic frame for where to report the bug.
2024-04-25T16:52:50+09:00 #
2024-04-25T16:52:50+09:00 
2024-04-25T16:52:50+09:00 [error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fa02d41723b]
2024-04-25T16:52:50+09:00 

Hello, DataDog teams. I have been facing java process crashing after update dd-agent from 1.32.0 to 1.33.0.

I guess this prevention (https://github.com/DataDog/java-profiler/pull/84) might cause SIGSEGV. I already opened a issue on java-profiler project. (https://github.com/DataDog/java-profiler/issues/87)

I'm always grateful.

stevenshim avatar Apr 25 '24 14:04 stevenshim

Hi @stevenshim - apologies for the inconvenience caused here and thanks for reporting the error. Could you give a little more information please?

  1. the stack trace from /opt/goods-detail-api/hs_err_pid1.log and the siginfo line just below it
  2. your Datadog profiling config flags

richardstartin avatar Apr 25 '24 15:04 richardstartin

@stevenshim I suspect that you must be running the live heap profiler (DD_PROFILING_DDPROF_LIVEHEAP_ENABLED: true or -Ddd.profiling.ddprof.liveheap.enabled=true) - disabling this should be enough until we have a fix.

richardstartin avatar Apr 26 '24 09:04 richardstartin

Hello @stevenshim - the issue was identified and a fix is in progress (https://github.com/DataDog/java-profiler/pull/88)

Thanks again for reporting this!

jbachorik avatar Apr 26 '24 17:04 jbachorik

Is that issue happening only when Ddd.profiling.ddprof.liveheap.enabled=true? I've got 2 JVM crashes a few hours after upgrading to 1.33.0 that also say The crash happened outside the Java Virtual Machine in native code. (there's not much more detail though), but we didn't enable this flag.

ghostdogpr avatar May 09 '24 00:05 ghostdogpr

I was able to replicate and save a crash dump. Maybe it's a different issue.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5d72c0a43b, pid=1, tid=165464
#
# JRE version: OpenJDK Runtime Environment Temurin-21.0.3+9 (21.0.3+9) (build 21.0.3+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (21.0.3+9-LTS, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# C  [ld-musl-x86_64.so.1+0x5943b]  strlen+0xd
#
# Core dump will be written. Default location: //core.1
#
# JFR recording file will be written. Location: //hs_err_pid1.jfr
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  S U M M A R Y ------------

Command Line: -XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=70.0 -XX:-UseCompressedOops -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/coredump -XX:OnError=/app/scripts-deploy/jvm-crash-handler.sh -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=9900 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.rmi.port=9900 -Djava.rmi.server.hostname=172.23.81.106 -javaagent:/app/external/dd-java-agent.jar -Ddd.profiling.enabled=true -Ddd.integration.zio.experimental.enabled=true -XX:+PrintCodeCache -XX:ProfiledCodeHeapSize=512m -XX:NonProfiledCodeHeapSize=1024m -XX:NonNMethodCodeHeapSize=32m -XX:ReservedCodeCacheSize=1568m -XX:+UseZGC -XX:+ZGenerational -Duser.timezone=Asia/Seoul -Dio.netty.allocator.numHeapArenas=0 -Ddd.integration.throwables.enabled=false --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.sql/java.sql=ALL-UNNAMED com.devsisters.ck.CkMain

Host: Intel(R) Xeon(R) Platinum 8488C, 16 cores, 110G, Alpine Linux v3.19
Time: Thu May  9 13:38:28 2024 KST elapsed time: 76054.516656 seconds (0d 21h 7m 34s)

---------------  T H R E A D  ---------------

Current thread (0x00007f571bf42040):  JavaThread "SIGTERM handler" daemon [_thread_in_native, id=165464, stack(0x00007f56bd317000,0x00007f56bd417a78) (1026K)]

Stack: [0x00007f56bd317000,0x00007f56bd417a78],  sp=0x00007f56bd416748,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [ld-musl-x86_64.so.1+0x5943b]  strlen+0xd
C  [libjavaProfiler11068430143537826157.so+0x29217]  Recording::writeSettings(Buffer*, Arguments&)+0xd7
C  [libjavaProfiler11068430143537826157.so+0x35a49]  Recording::switchChunk(int)+0xf9
C  [libjavaProfiler11068430143537826157.so+0x35b2f]  FlightRecorder::dump(char const*, int)+0x6f
C  [libjavaProfiler11068430143537826157.so+0x457d0]  Profiler::dump(char const*, int)+0x200
C  [libjavaProfiler11068430143537826157.so+0x245c7]  Java_com_datadoghq_profiler_JavaProfiler_dump0+0x47
J 323292  com.datadoghq.profiler.JavaProfiler.dump0(Ljava/lang/String;)V (0 bytes) @ 0x00007f5d31c718ff [0x00007f5d31c71820+0x00000000000000df]
J 323289 c1 com.datadog.profiling.ddprof.DatadogProfilerRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (44 bytes) @ 0x00007f5d0b178be4 [0x00007f5d0b1788a0+0x0000000000000344]
J 323287 c1 com.datadog.profiling.controller.ddprof.DatadogProfilerOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (16 bytes) @ 0x00007f5d0a72d8d4 [0x00007f5d0a72d6e0+0x00000000000001f4]
J 260563 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording$$Lambda+0x00007f58c81d9338.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007f5d1469d84c [0x00007f5d1469d5e0+0x000000000000026c]
J 4274 c2 java.util.stream.ReferencePipeline$3$1.accept(Ljava/lang/Object;)V [email protected] (23 bytes) @ 0x00007f5d28ec3d18 [0x00007f5d28ec3ca0+0x0000000000000078]
J 47514 c2 java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Ljava/util/function/Consumer;)V [email protected] (127 bytes) @ 0x00007f5d2b963e38 [0x00007f5d2b963d40+0x00000000000000f8]
J 32061 c2 java.util.stream.ReferencePipeline.collect(Ljava/util/stream/Collector;)Ljava/lang/Object; [email protected] (124 bytes) @ 0x00007f5d2a9254ec [0x00007f5d2a925180+0x000000000000036c]
J 323261 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (12 bytes) @ 0x00007f5d0a4c4754 [0x00007f5d0a4c4000+0x0000000000000754]
J 323260 c1 com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot(Z)V (144 bytes) @ 0x00007f5d07f3b784 [0x00007f5d07f3b540+0x0000000000000244]
j  com.datadog.profiling.controller.ProfilingSystem.shutdown(Z)V+29
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Lcom/datadog/profiling/controller/ProfilingSystem;Lcom/datadog/profiling/uploader/ProfileUploader;Z)V+17
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Z)V+7
j  java.lang.invoke.LambdaForm$DMH+0x00007f58cc09e800.invokeStatic(Ljava/lang/Object;I)V+10 [email protected]
j  java.lang.invoke.LambdaForm$MH+0x00007f58cc09f800.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 [email protected]
J 45127 c2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; [email protected] (108 bytes) @ 0x00007f5d2b6454b8 [0x00007f5d2b644f80+0x0000000000000538]
j  datadog.trace.bootstrap.Agent.shutdownProfilingAgent(Z)V+64
j  datadog.trace.bootstrap.Agent.shutdown(Z)V+15
j  datadog.trace.bootstrap.instrumentation.shutdown.ShutdownHelper.shutdownAgent()V+11
j  java.lang.Shutdown.runHooks()V+0 [email protected]
j  java.lang.Shutdown.exit(I)V+12 [email protected]
j  java.lang.Terminator$1.handle(Ljdk/internal/misc/Signal;)V+8 [email protected]
j  jdk.internal.misc.Signal$1.run()V+8 [email protected]
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 [email protected]
j  java.lang.Thread.run()V+19 [email protected]
v  ~StubRoutines::call_stub 0x00007f5d26ccbcbf
V  [libjvm.so+0x912765]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x2e5
V  [libjvm.so+0x9140ab]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1cb
V  [libjvm.so+0x9ee64a]  thread_entry(JavaThread*, JavaThread*)+0x8a
V  [libjvm.so+0x928ccf]  JavaThread::thread_main_inner() [clone .part.0]+0xaf
V  [libjvm.so+0xf82c98]  Thread::call_run()+0xa8
V  [libjvm.so+0xd0b23a]  thread_native_entry(Thread*)+0x12a
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 323292  com.datadoghq.profiler.JavaProfiler.dump0(Ljava/lang/String;)V (0 bytes) @ 0x00007f5d31c71887 [0x00007f5d31c71820+0x0000000000000067]
J 323289 c1 com.datadog.profiling.ddprof.DatadogProfilerRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (44 bytes) @ 0x00007f5d0b178be4 [0x00007f5d0b1788a0+0x0000000000000344]
J 323287 c1 com.datadog.profiling.controller.ddprof.DatadogProfilerOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (16 bytes) @ 0x00007f5d0a72d8d4 [0x00007f5d0a72d6e0+0x00000000000001f4]
J 260563 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording$$Lambda+0x00007f58c81d9338.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007f5d1469d84c [0x00007f5d1469d5e0+0x000000000000026c]
J 4274 c2 java.util.stream.ReferencePipeline$3$1.accept(Ljava/lang/Object;)V [email protected] (23 bytes) @ 0x00007f5d28ec3d18 [0x00007f5d28ec3ca0+0x0000000000000078]
J 47514 c2 java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Ljava/util/function/Consumer;)V [email protected] (127 bytes) @ 0x00007f5d2b963e38 [0x00007f5d2b963d40+0x00000000000000f8]
J 32061 c2 java.util.stream.ReferencePipeline.collect(Ljava/util/stream/Collector;)Ljava/lang/Object; [email protected] (124 bytes) @ 0x00007f5d2a9254ec [0x00007f5d2a925180+0x000000000000036c]
J 323261 c1 com.datadog.profiling.agent.CompositeController$CompositeOngoingRecording.snapshot(Ljava/time/Instant;Ldatadog/trace/api/profiling/ProfilingSnapshot$Kind;)Ldatadog/trace/api/profiling/RecordingData; (12 bytes) @ 0x00007f5d0a4c4754 [0x00007f5d0a4c4000+0x0000000000000754]
J 323260 c1 com.datadog.profiling.controller.ProfilingSystem$SnapshotRecording.snapshot(Z)V (144 bytes) @ 0x00007f5d07f3b784 [0x00007f5d07f3b540+0x0000000000000244]
j  com.datadog.profiling.controller.ProfilingSystem.shutdown(Z)V+29
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Lcom/datadog/profiling/controller/ProfilingSystem;Lcom/datadog/profiling/uploader/ProfileUploader;Z)V+17
j  com.datadog.profiling.agent.ProfilingAgent.shutdown(Z)V+7
j  java.lang.invoke.LambdaForm$DMH+0x00007f58cc09e800.invokeStatic(Ljava/lang/Object;I)V+10 [email protected]
j  java.lang.invoke.LambdaForm$MH+0x00007f58cc09f800.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 [email protected]
J 45127 c2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; [email protected] (108 bytes) @ 0x00007f5d2b6454b8 [0x00007f5d2b644f80+0x0000000000000538]
j  datadog.trace.bootstrap.Agent.shutdownProfilingAgent(Z)V+64
j  datadog.trace.bootstrap.Agent.shutdown(Z)V+15
j  datadog.trace.bootstrap.instrumentation.shutdown.ShutdownHelper.shutdownAgent()V+11
j  java.lang.Shutdown.runHooks()V+0 [email protected]
j  java.lang.Shutdown.exit(I)V+12 [email protected]
j  java.lang.Terminator$1.handle(Ljdk/internal/misc/Signal;)V+8 [email protected]
j  jdk.internal.misc.Signal$1.run()V+8 [email protected]
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 [email protected]
j  java.lang.Thread.run()V+19 [email protected]
v  ~StubRoutines::call_stub 0x00007f5d26ccbcbf

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f58030185e4

Registers:
RAX=0x00007f58030185e4, RBX=0x000000000000218a, RCX=0x000000007265746c, RDX=0x0000000000000006
RSP=0x00007f56bd416748, RBP=0x00007f56bd416790, RSI=0x0000000000000006, RDI=0x00007f58030185e4
R8 =0x0000000000002182, R9 =0x0000000000002171, R10=0x0000000000000009, R11=0x0000000000000246
R12=0x00007f57fcd6a010, R13=0x00007f57fcd6a010, R14=0x0000000000002171, R15=0x00007f58030185e4
RIP=0x00007f5d72c0a43b, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

ghostdogpr avatar May 09 '24 05:05 ghostdogpr

Hi @ghostdogpr that looks like another issue, apologies for the inconvenience and thanks for the backtrace. I recommend downgrading until we've figured out the cause (we'll update you).

richardstartin avatar May 09 '24 09:05 richardstartin