dd-trace-java icon indicating copy to clipboard operation
dd-trace-java copied to clipboard

v1.34.0 JVM crash

Open zBart opened this issue 1 year ago • 21 comments
trafficstars

We recently upgraded dd-trace-java from 1.31.2 to 1.34.0 and have gotten a crash. We've only seen this happens once so far.

Not completely sure if this is a Datadog issue or a JVM issue.

Here is the log:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (codeCache.cpp:654), pid=14, tid=430
#  guarantee(is_result_safe || is_in_asgct()) failed: unsafe access to zombie method
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.11.9.1 (17.0.11+9) (build 17.0.11+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.11.9.1 (17.0.11+9-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x5b8d61]  CodeCache::find_blob(void*)+0xc1
#
# The JFR repository may contain useful JFR files. Location: /dumps/2024_05_30_12_10_17_14
#

[...]

---------------  T H R E A D  ---------------

Current thread (0x00007f36d80e94f0):  JavaThread "pool-18-thread-63" [_thread_in_Java, id=430, stack(0x00007f369c4b5000,0x00007f369c5b6000)]

Stack: [0x00007f369c4b5000,0x00007f369c5b6000],  sp=0x00007f369c5b3400,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5b8d61]  CodeCache::find_blob(void*)+0xc1
V  [libjvm.so+0xeee39b]  JavaThread::pd_get_top_frame_for_signal_handler(frame*, void*, bool)+0x1ab
V  [libjvm.so+0x6ef823]  AsyncGetCallTrace+0x1c3
C  [libjavaProfiler9259657795778516739.so+0x26926]  Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int, StackContext*, bool*)+0x176
C  [libjavaProfiler9259657795778516739.so+0x27811]  Profiler::recordSample(void*, unsigned long long, int, int, Event*)+0x1e1
C  [libjavaProfiler9259657795778516739.so+0x429d2]  WallClock::sharedSignalHandler(int, siginfo_t*, void*)+0x1a2
C  [libpthread.so.0+0x118e0]
C  [linux-vdso.so.1+0xc62]  clock_gettime+0x242
C  [libc.so.6+0xfc426]  __clock_gettime+0x26

Registers:
[...]

zBart avatar Jun 10 '24 09:06 zBart

Hi @zBart did you change anything else when you did the upgrade? Could you open a support ticket so I can get access to the hs_err file please?

richardstartin avatar Jun 10 '24 09:06 richardstartin

We have identified the cause of this bug and are working on releasing a fix

richardstartin avatar Jun 12 '24 14:06 richardstartin

I have similar symptoms. Is it the same reason??

I recently upgraded jdk from 17 to 23.

jdk used: openjdk:23-jdk-slim

dd-java-agent information: /datadog/dd-java-agent.jar 'https://dtdg.co/latest-java-tracer'

Error message:

[thread 52 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
# JRE version: OpenJDK Runtime Environment (23.0+26) (build 23-ea+26-2269)
# Java VM: OpenJDK 64-Bit Server VM (23-ea+26-2269, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
#
# Core dump will be written. Default location: /app/core.7
#
[65.819s][warning][jfr] Unable to create an emergency dump file at the location set by dumppath=/app
# The JFR repository may contain useful JFR files. Location: /tmp/2024_06_14_02_02_54_7
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid7.log
[65.908s][warning][os ] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Aborted

foameraserblue avatar Jun 14 '24 06:06 foameraserblue

@foameraserblue there's nothing to suggest this was the same issue nor that it is related to the profiler. If you have the backtrace we would be able to determine if it's related or not. Since you're running an early access JDK23 build, it could always be a JVM problem.

richardstartin avatar Jun 14 '24 07:06 richardstartin

We have identified the cause of this bug and are working on releasing a fix

Hey Richard. Great! Do you still need me to send the hs_err file?

zBart avatar Jun 18 '24 07:06 zBart

We had a similar issue recently that affected some of our services running Java 21 . The same services had been running for months on Java 21, and years on Java 17, suddenly started crashing repeatedly during startup. We have a staging version of these services which were also experiencing the crashing but less reliably so.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa336da70, pid=1, tid=52
#
# JRE version: OpenJDK Runtime Environment (21.0.3+10) (build 21.0.3+10-LTS)
# Java VM: OpenJDK 64-Bit Server VM (21.0.3+10-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  [libpthread.so.0+0xba70]	
#	
# Core dump will be written. Default location: /workdir/core.1
#
# JFR recording file will be written. Location: /workdir/hs_err_pid1.jfr
#
# An error report file with more information is saved as:
# /workdir/hs_err_pid1.log	
[17.998s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bell-sw.com/support
#
[error occurred during error reporting (), id 0x5, SIGTRAP (0x5) at pc=0x0000ffffa31fc9c0]

We could not pinpoint a cause. We tried downgrading to Java 17 had the same issue. Had no sense of the cause, and due to our services running in AWS Fargate we were not able to access the error logs files

Luckily we found these past issues which suggested possibly dd-trace-java as the cause https://github.com/DataDog/dd-trace-java/issues/677 https://github.com/DataDog/dd-trace-java/issues/2997 https://github.com/DataDog/dd-trace-java/issues/4978 https://github.com/DataDog/dd-trace-java/issues/5077 https://github.com/DataDog/dd-trace-java/issues/5449 https://github.com/DataDog/dd-trace-java/issues/6382 https://github.com/DataDog/dd-trace-java/issues/6899 https://github.com/DataDog/dd-trace-java/issues/6948

particularly this comment from @richardstartin :

For transparency's sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.

Our build pulls the latest dd-trace-java into our images at build-time, so it seemed like this could have been the cause since no other dependencies were updated. It would have been running 1.34.0 at the time of crashes.

We then noticed that services which did not have profiling enabled were not crashing. Also, our staging services receive only 1% of the traffic of our prod services, and the DD_TRACE_SAMPLE_RATE was 0.001 so that provided some explanation as to why the staging services were sometimes able to succeed (traffic didn't hit the profiler by chance).

Setting DD_PROFILING_ENABLED=false resolved the issue immediately.

Going forward our workaround is to not using the profiling feature.

We have identified the cause of this bug and are working on releasing a fix

This is great to hear, would appreciate any updates to give us confidence on re-enabling the feature

credpath-seek avatar Jun 20 '24 00:06 credpath-seek

Hi @credpath-seek, this is unexpected. We test the profiler in lots of environments, and it runs in many other environments without these issues, so there is probably an incompatibility with something in your environment which we'd like to get to the bottom of to avoid this happening again. If we can get the backtrace and the siginfo sections from the hs_err files we would be able to pinpoint the cause. Just based on the error message, the cause is different to the crash reported (which is within AsyncGetCallTrace). If you do have this information available, please either reply here or open a support ticket so we can get the underlying issue fixed (and tested for).

You could always re-enable profiling but set DD_PROFILING_DDPROF_ENABLED: false - which would make the profiler fall back to JFR. The profiles will be less detailed, but both crashes reported in this issue (one in AsyncGetCallTrace, as well as yours) would have been avoided.

richardstartin avatar Jun 20 '24 09:06 richardstartin

@zBart hs_err files are welcome. They help us identify how we can further protect the usage of AsyncGetCallTrace. Reproducers are even better, though I can see how that is hard to come by.

Just to give more context on the profiling direction and what @richardstartin mentioned. We continue maintaining two flavours of profiling.

  • One based on this library (DD_PROFILING_DDPROF_ENABLED: true)
  • One based on JFR events (DD_PROFILING_DDPROF_ENABLED: false)

So by switching to the JFR events you will have a stable profiling experience. If you feel you are missing some features, we would be happy to get this feedback. There are more long term initiatives ongoing to improve the stability of observing Java.

r1viollet avatar Jun 20 '24 11:06 r1viollet

Hi @zBart did you change anything else when you did the upgrade? Could you open a support ticket so I can get access to the hs_err file please?

@zBart hs_err files are welcome. They help us identify how we can further protect the usage of AsyncGetCallTrace. Reproducers are even better, though I can see how that is hard to come by.

Done, ticket ID is: #1749443

Any idea for an ETA for a fix?

zBart avatar Jun 24 '24 12:06 zBart

Hi @zBart - the fix is in https://github.com/DataDog/java-profiler/pull/107 which will be coming to dd-trace-java in the upcoming release (beginning of July 2024)

jbachorik avatar Jun 24 '24 13:06 jbachorik

Is this fixed by https://github.com/DataDog/dd-trace-java/pull/7229 ?

oriy avatar Jun 30 '24 08:06 oriy

@oriy If you mean the one reported @zBart then yes, the fix is there and available in dd-trace-java 1.36.0

jbachorik avatar Jul 01 '24 12:07 jbachorik

@oriy If you mean the one reported @zBart then yes, the fix is there and available in dd-trace-java 1.36.0

meant asking if #7144 is fixed by #7229 we have tried deploying 1.36.0, profile 1.9 included, still crash with SIGSEGV

Running latest Temurin 11 on g1 linux-aarch64

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff843c8c98, pid=1, tid=117
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.23+9 (11.0.23+9) (build 11.0.23+9)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.23+9 (11.0.23+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x608c98]  frame::safe_for_sender(JavaThread*)+0x278
#
# Core dump will be written. Default location: /opt/app/core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

oriy avatar Jul 01 '24 14:07 oriy

@oriy This is a different failure mode, unfortunately :( Would you be able to file a support ticket and include the whole hs_err file there?

jbachorik avatar Jul 01 '24 15:07 jbachorik

Got the same errors on the latest AWS Graviton CPU (7g) with Datadog profiler enabled.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff96d828a8, pid=1, tid=298
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.10.0.7-1) (17.0.10+7) (build 17.0.10+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.10.0.7-1) (17.0.10+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, parallel gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x64f8a8]  Dictionary::find_class(unsigned int, Symbol*)+0x0
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa372e798, pid=1, tid=251
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.10.0.7-1) (17.0.10+7) (build 17.0.10+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.10.0.7-1) (17.0.10+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, parallel gc, linux-aarch64)
# Problematic frame:
# C  [libc.so.6+0x28798]  __GI___memcpy_simd+0xd8

And there are no errors with disabled DD profiler by setting DD_PROFILING_DDPROF_ENABLED=false.

ivan-sukhomlyn avatar Jul 02 '24 15:07 ivan-sukhomlyn

@ivan-sukhomlyn Hi, thanks for the report. This is a different failure mode - V [libjvm.so+0x64f8a8] Dictionary::find_class(unsigned int, Symbol*)+0x0 vs. V [libjvm.so+0x5b8d61] CodeCache::find_blob(void*)+0xc1

Would you be able to open a support ticket and submit the full hs_err.log file there so we can properly analyze the full crash? Thanks!

jbachorik avatar Jul 02 '24 16:07 jbachorik

@oriy This is a different failure mode, unfortunately :( Would you be able to file a support ticket and include the whole hs_err file there?

Sure @jbachorik, submitted ticket 1760710 Thanks 🙏

oriy avatar Jul 03 '24 23:07 oriy

Hi @jbachorik Thank you for the suggestion.

This is an answer from the support team regarding the requests mentioned in this issue.

Thanks to this, our team took a look at this and it seems to be unrelated to the tracer and the profiler. In error-log-4 we see this is related to class transformation, and not profiling. Specifically the failure is in JVMTI ClassFileLoadHook internals and this is outside of anything the tracer or profiler can do. Considering that this issue started happening with upgrading Graviton processor to v.3 it is very likely that this is a bug in JVMTI implementation manifesting on Graviton 3. The reason why the issue is resolved by setting DD_PROFILING_DDPROF_ENABLED=false is that the ClassFileLoadHook is not set up any more and the buggy JVMTI code is not executed.

The recommended step here is to open a support ticket with the JDK vendor and let us know how it goes?In the ticket you add that Datadog Engineering team looked at this and it might be related to https://bugs.openjdk.org/browse/JDK-8307315, but that’s really unsure. Adding this in case it might help find the root cause. Let me know if you have any questions on this.

ivan-sukhomlyn avatar Jul 05 '24 16:07 ivan-sukhomlyn

We had a similar (or the same?) crash happen today, running on plugin version v1.37.1:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (codeCache.cpp:654), pid=14, tid=80
#  guarantee(is_result_safe || is_in_asgct()) failed: unsafe access to zombie method
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.12.7.1 (17.0.12+7) (build 17.0.12+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.12.7.1 (17.0.12+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x5b9d21]  CodeCache::find_blob(void*)+0xc1
#

[...]


---------------  T H R E A D  ---------------

Current thread (0x00007f71b649b7a0):  JavaThread "SFSWorker:Ext:2" [_thread_in_Java, id=80, stack(0x00007f71034f5000,0x00007f71035f6000)]

Stack: [0x00007f71034f5000,0x00007f71035f6000],  sp=0x00007f71035f2d10,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5b9d21]  CodeCache::find_blob(void*)+0xc1
V  [libjvm.so+0xef423b]  JavaThread::pd_get_top_frame_for_signal_handler(frame*, void*, bool)+0x1ab
V  [libjvm.so+0x6f0963]  AsyncGetCallTrace+0x1c3
C  [libjavaProfiler2894210155550895396.so+0x1df86]  Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int, StackContext*, bool*)+0x176
C  [libjavaProfiler2894210155550895396.so+0x1ee7f]  Profiler::recordSample(void*, unsigned long long, int, int, unsigned int, Event*)+0x1ef
C  [libjavaProfiler2894210155550895396.so+0x1253b]  CTimer::signalHandler(int, siginfo_t*, void*)+0x13b
C  [libpthread.so.0+0x118e0]
C  [linux-vdso.so.1+0xc50]  clock_gettime+0x230
C  [libc.so.6+0xfc426]  __clock_gettime+0x26

Registers:
[...]

I sent the hs_err file in ticket ID 1848726

zBart avatar Sep 18 '24 09:09 zBart

Hi @zBart - this is a different failure mode. The root cause is that AsyncGetCallTrace is not reentrant and if eg. cpu sampler gets interrupted in the middle of a call to AsyncGetCallTrace by the wallclock sampler which will try to execute ASGCT again, things can go wrong and you end up crashing. We have a workaround for that situation in https://github.com/DataDog/java-profiler/pull/107 but that one was integrated only in the latest version of dd-trace-java (1.39.0).

jbachorik avatar Sep 18 '24 11:09 jbachorik

Turns out to be a ridiculous JVM bug JDK-8352649: Thread class had two _in_asgct fields: AsyncGetCallTrace set one, but guarantee read another.

apangin avatar Mar 23 '25 23:03 apangin