Corretto 21 JVM crash within EKS
Describe the bug
Our Java application running on Corretto 21.0.7.6.1 experienced a fatal JVM crash caused by a segmentation fault (SIGSEGV). It seems to occur randomly. The app is a short-lived microservice and a lot of instances are created and destroyed each day. It crashes very rarely maybe a few times a week.
Problematic frame: V [libjvm.so+0x9249d0] JavaThread::is_interrupted(bool)+0x0 The crash happened in the context of a background thread named StatsD-Sender-1, part of the dogstatsd Java client.
To Reproduce
We have not been able to reliably reproduce the issue, but it occurred during normal operation of our service.
Expected behavior
The JVM should remain stable and not crash.
Platform information
OS: Red Hat Enterprise Linux 9.6 (Plow)
Version Corretto-21.0.7.6.1
Additional context
The application was running as a containerized Java process inside a Kubernetes cluster. Datadog’s Java agent was in use.
Hi @metodski, Thank you for reporting the issue. The datadog agent has history of crashes: https://github.com/search?q=repo%3ADataDog%2Fdd-trace-java+jvm+crash&type=issues
Can you provide other hs_error logs to check crashes happening in JavaThread::is_interrupted?
We recommend you to report the crash to DataDog: https://github.com/DataDog/dd-trace-java/issues
@jbachorik could you please take a quick look at this issue? Have you seen such crashes before and do you think it is related to Datadog's Java client or do you think it is Corretto specific?
Hm, the crash stack does not contain anything obviously DD related. But reporting the crash to DD and opening and escalation will allow us to gather more information in a controlled manner.
Following the disassembly at the PC where the crash occurs (i.e. 0x00007f37b403e9d0, libjvm.so+0x9249d0] or JavaThread::is_interrupted(bool)+0x0. Register rdi is 0 so we get a SIGSEGV with si_code: 1 (SEGV_MAPERR) and si_addr: 0x0000000000000378. rdi is the first argument register so for JavaThread::is_interrupted(bool) it should contain the this pointer (rsi, the first method parameter is 0, i.e. false).
0x00007f37b403e9d0: 48 8B 87 78 03 00 00 mov rax, qword ptr [rdi + 0x378]
0x00007f37b403e9d7: 48 85 C0 test rax, rax
0x00007f37b403e9da: 0F 84 80 00 00 00 je 0x7f37b403ea60
0x00007f37b403e9e0: 55 push rbp
0x00007f37b403e9e1: 48 89 E5 mov rbp, rsp
0x00007f37b403e9e4: 41 54 push r12
0x00007f37b403e9e6: 49 89 FC mov r12, rdi
0x00007f37b403e9e9: 53 push rbx
0x00007f37b403e9ea: 48 89 C7 mov rdi, rax
0x00007f37b403e9ed: 89 F3 mov ebx, esi
0x00007f37b403e9ef: FF 15 3B A3 E0 00 call qword ptr [rip + 0xe0a33b]
0x00007f37b403e9f5: 48 85 C0 test rax, rax
0x00007f37b403e9f8: 74 46 je 0x7f37b403ea40
0x00007f37b403e9fa: 49 8B BC 24 78 03 00 00 mov rdi, qword ptr [r12 + 0x378]
0x00007f37b403ea02: 48 85 FF test rdi, rdi
0x00007f37b403ea05: 74 49 je 0x7f37b403ea50
0x00007f37b403ea07: FF 15 FB 55 E0 00 call qword ptr [rip + 0xe055fb]
0x00007f37b403ea0d: 48 89 C7 mov rdi, rax
0x00007f37b403ea10: E8 8B 1C FF FF call 0x7f37b40306a0
0x00007f37b403ea15: 20 C3 and bl, al
0x00007f37b403ea17: 74 1F je 0x7f37b403ea38
0x00007f37b403ea19: 49 8B BC 24 78 03 00 00 mov rdi, qword ptr [r12 + 0x378]
0x00007f37b403ea21: 48 85 FF test rdi, rdi
0x00007f37b403ea24: 74 32 je 0x7f37b403ea58
0x00007f37b403ea26: FF 15 DC 55 E0 00 call qword ptr [rip + 0xe055dc]
0x00007f37b403ea2c: 48 89 C7 mov rdi, rax
0x00007f37b403ea2f: 31 F6 xor esi, esi
0x00007f37b403ea31: E8 8A 1C FF FF call 0x7f37b40306c0
0x00007f37b403ea36: 89 D8 mov eax, ebx
0x00007f37b403ea38: 5B pop rbx
0x00007f37b403ea39: 41 5C pop r12
0x00007f37b403ea3b: 5D pop rbp
0x00007f37b403ea3c: C3 ret
0x00007f37b403ea3d: 0F 1F 00 nop dword ptr [rax]
0x00007f37b403ea40: 5B pop rbx
0x00007f37b403ea41: 31 C0 xor eax, eax
0x00007f37b403ea43: 41 5C pop r12
0x00007f37b403ea45: 5D pop rbp
0x00007f37b403ea46: C3 ret
And here's the disassembly of Unsafe_Park() which calls Parker::park(bool, long at ...fc1496 (notice that Parker::park(bool, long is missing from the native stack trace in the hs_err file):
0000000000fc13c0 <Unsafe_Park>:
fc13c0: 55 push %rbp
fc13c1: 48 89 e5 mov %rsp,%rbp
fc13c4: 41 57 push %r15
fc13c6: 41 56 push %r14
fc13c8: 41 55 push %r13
fc13ca: 41 54 push %r12
fc13cc: 49 89 fd mov %rdi,%r13
fc13cf: 53 push %rbx
fc13d0: 48 8d 9f 48 fc ff ff lea -0x3b8(%rdi),%rbx
fc13d7: 41 89 d6 mov %edx,%r14d
fc13da: 49 89 cc mov %rcx,%r12
fc13dd: 48 83 ec 78 sub $0x78,%rsp
fc13e1: 8b 83 68 04 00 00 mov 0x468(%rbx),%eax
fc13e7: 2d ad de 00 00 sub $0xdead,%eax
fc13ec: 83 f8 01 cmp $0x1,%eax
fc13ef: 0f 86 f3 01 00 00 jbe fc15e8 <Unsafe_Park+0x228>
fc13f5: 48 8d 05 a4 eb 7a 00 lea 0x7aeba4(%rip),%rax # 176ffa0 <UseSystemMemoryBarrier>
fc13fc: 80 38 00 cmpb $0x0,(%rax)
fc13ff: c7 83 44 04 00 00 06 movl $0x6,0x444(%rbx)
fc1406: 00 00 00
fc1409: 75 05 jne fc1410 <Unsafe_Park+0x50>
fc140b: f0 83 04 24 00 lock addl $0x0,(%rsp)
fc1410: 48 8b 83 48 04 00 00 mov 0x448(%rbx),%rax
fc1417: a8 01 test $0x1,%al
fc1419: 0f 85 b1 01 00 00 jne fc15d0 <Unsafe_Park+0x210>
fc141f: 8b 83 40 04 00 00 mov 0x440(%rbx),%eax
fc1425: a8 0c test $0xc,%al
fc1427: 74 08 je fc1431 <Unsafe_Park+0x71>
fc1429: 48 89 df mov %rbx,%rdi
fc142c: e8 2f 7a 96 ff call 928e60 <JavaThread::handle_special_runtime_exit_condition()>
fc1431: 4c 8d 3d a8 37 7b 00 lea 0x7b37a8(%rip),%r15 # 1774be0 <JfrEventSetting::_jvm_event_settings>
fc1438: c7 83 44 04 00 00 06 movl $0x6,0x444(%rbx)
fc143f: 00 00 00
fc1442: 48 c7 45 90 00 00 00 movq $0x0,-0x70(%rbp)
fc1449: 00
fc144a: 48 c7 45 98 00 00 00 movq $0x0,-0x68(%rbp)
fc1451: 00
fc1452: c6 45 a0 00 movb $0x0,-0x60(%rbp)
fc1456: c6 45 a1 00 movb $0x0,-0x5f(%rbp)
fc145a: 41 80 bf 01 01 00 00 cmpb $0x0,0x101(%r15)
fc1461: 00
fc1462: c6 45 a2 00 movb $0x0,-0x5e(%rbp)
fc1466: 0f 85 8c 01 00 00 jne fc15f8 <Unsafe_Park+0x238>
fc146c: 31 d2 xor %edx,%edx
fc146e: 48 8d bd 60 ff ff ff lea -0xa0(%rbp),%rdi
fc1475: 4d 85 e4 test %r12,%r12
fc1478: 0f 95 c2 setne %dl
fc147b: 48 89 de mov %rbx,%rsi
fc147e: e8 3d 16 00 00 call fc2ac0 <JavaThreadParkedState::JavaThreadParkedState(JavaThread*, bool)>
fc1483: 31 f6 xor %esi,%esi
fc1485: 48 8d bb 10 06 00 00 lea 0x610(%rbx),%rdi
fc148c: 45 84 f6 test %r14b,%r14b
fc148f: 40 0f 95 c6 setne %sil
fc1493: 4c 89 e2 mov %r12,%rdx
fc1496: e8 f5 19 d4 ff call d02e90 <Parker::park(bool, long)>
fc149b: 41 80 bf 01 01 00 00 cmpb $0x0,0x101(%r15)
And finally Parker::park(bool, long which calls JavaThread::is_interrupted(bool):
0000000000d02e90 <Parker::park(bool, long)>:
d02e90: 31 c9 xor %ecx,%ecx
d02e92: 87 0f xchg %ecx,(%rdi)
d02e94: 85 c9 test %ecx,%ecx
d02e96: 7e 08 jle d02ea0 <Parker::park(bool, long)+0x10>
d02e98: f3 c3 repz ret
d02e9a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
d02ea0: 55 push %rbp
d02ea1: 48 89 e5 mov %rsp,%rbp
d02ea4: 41 57 push %r15
d02ea6: 41 56 push %r14
d02ea8: 41 55 push %r13
d02eaa: 41 54 push %r12
d02eac: 49 89 d6 mov %rdx,%r14
d02eaf: 53 push %rbx
d02eb0: 41 89 f7 mov %esi,%r15d
d02eb3: 41 89 f4 mov %esi,%r12d
d02eb6: 48 89 fb mov %rdi,%rbx
d02eb9: 48 83 ec 58 sub $0x58,%rsp
d02ebd: 66 48 8d 3d bb 70 a2 data16 lea 0xa270bb(%rip),%rdi # 1729f80 <_GLOBAL_OFFSET_TABLE_+0xb90>
d02ec4: 00
d02ec5: 66 66 48 e8 f3 fc 59 data16 data16 rex.W call 2a2bc0 <__tls_get_addr@plt>
d02ecc: ff
d02ecd: 31 f6 xor %esi,%esi
d02ecf: 4c 8b 28 mov (%rax),%r13
d02ed2: 4c 89 ef mov %r13,%rdi
d02ed5: e8 f6 1a c2 ff call 9249d0 <JavaThread::is_interrupted(bool)>
d02eda: 4d 85 f6 test %r14,%r14
It copies rdi from r13 before the call which is plausible, because r13 is 0 as well in the hs_err file when the crash happens. It also initializes the first method paramter (i.e. esi) to 0 which is also plausible, because Parker::park() calls JavaThread::is_interrupted() with a false argument:
void Parker::park(bool isAbsolute, jlong time) {
// Optional fast-path check:
// Return immediately if a permit is available.
// We depend on Atomic::xchg() having full barrier semantics
// since we are doing a lock-free update to _counter.
if (Atomic::xchg(&_counter, 0) > 0) return;
JavaThread *jt = JavaThread::current();
// Optional optimization -- avoid state transitions if there's
// an interrupt pending.
if (jt->is_interrupted(false)) {
return;
}
So it looks like JavaThread::current() which reads the current JavaThread from a thread local variable returns 0 for the current thread. I'm not sure how this can happen? According to the comment in the assertion in Thread::current() this can happen if a thread has already been detached:
class JavaThread: public Thread {
...
static JavaThread* current() {
return JavaThread::cast(Thread::current());
}
...
}
...
inline Thread* Thread::current() {
Thread* current = current_or_null();
assert(current != nullptr, "Thread::current() called on detached thread");
return current;
}
...
inline Thread* Thread::current_or_null() {
#ifndef USE_LIBRARY_BASED_TLS_ONLY
return _thr_current;
#else
if (ThreadLocalStorage::is_initialized()) {
return ThreadLocalStorage::thread();
}
return nullptr;
#endif
}
...
#ifndef USE_LIBRARY_BASED_TLS_ONLY
// Current thread is maintained as a thread-local variable
THREAD_LOCAL Thread* Thread::_thr_current = nullptr;
#endif
The hs_err file also doesn't contain a list of active Java threads because of:
[error occurred during error reporting (printing all threads), id 0xb, SIGSEGV (0xb) at pc=0x00007f37b46ad69d]
so it seems that indeed the HotSpot internal bookkeeping of Java threads was corrupted.
@simonis Thanks for the explanation. I submitted a ticket in dd-trace-java as well