corretto-21 icon indicating copy to clipboard operation
corretto-21 copied to clipboard

Corretto 21 JVM crash within EKS

Open metodski opened this issue 8 months ago • 5 comments

Describe the bug

Our Java application running on Corretto 21.0.7.6.1 experienced a fatal JVM crash caused by a segmentation fault (SIGSEGV). It seems to occur randomly. The app is a short-lived microservice and a lot of instances are created and destroyed each day. It crashes very rarely maybe a few times a week.

Problematic frame: V [libjvm.so+0x9249d0] JavaThread::is_interrupted(bool)+0x0 The crash happened in the context of a background thread named StatsD-Sender-1, part of the dogstatsd Java client.

To Reproduce

We have not been able to reliably reproduce the issue, but it occurred during normal operation of our service.

Expected behavior

The JVM should remain stable and not crash.

Platform information

OS: Red Hat Enterprise Linux 9.6 (Plow)
Version Corretto-21.0.7.6.1

Additional context

The application was running as a containerized Java process inside a Kubernetes cluster. Datadog’s Java agent was in use.

hs_error_pid1.log

metodski avatar Jun 23 '25 06:06 metodski

Hi @metodski, Thank you for reporting the issue. The datadog agent has history of crashes: https://github.com/search?q=repo%3ADataDog%2Fdd-trace-java+jvm+crash&type=issues

Can you provide other hs_error logs to check crashes happening in JavaThread::is_interrupted?

We recommend you to report the crash to DataDog: https://github.com/DataDog/dd-trace-java/issues

eastig avatar Jun 25 '25 12:06 eastig

@jbachorik could you please take a quick look at this issue? Have you seen such crashes before and do you think it is related to Datadog's Java client or do you think it is Corretto specific?

simonis avatar Jun 25 '25 13:06 simonis

Hm, the crash stack does not contain anything obviously DD related. But reporting the crash to DD and opening and escalation will allow us to gather more information in a controlled manner.

jbachorik avatar Jun 25 '25 14:06 jbachorik

Following the disassembly at the PC where the crash occurs (i.e. 0x00007f37b403e9d0, libjvm.so+0x9249d0] or JavaThread::is_interrupted(bool)+0x0. Register rdi is 0 so we get a SIGSEGV with si_code: 1 (SEGV_MAPERR) and si_addr: 0x0000000000000378. rdi is the first argument register so for JavaThread::is_interrupted(bool) it should contain the this pointer (rsi, the first method parameter is 0, i.e. false).

0x00007f37b403e9d0:  48 8B 87 78 03 00 00             mov  rax, qword ptr [rdi + 0x378]
0x00007f37b403e9d7:  48 85 C0                         test rax, rax
0x00007f37b403e9da:  0F 84 80 00 00 00                je   0x7f37b403ea60
0x00007f37b403e9e0:  55                               push rbp
0x00007f37b403e9e1:  48 89 E5                         mov  rbp, rsp
0x00007f37b403e9e4:  41 54                            push r12
0x00007f37b403e9e6:  49 89 FC                         mov  r12, rdi
0x00007f37b403e9e9:  53                               push rbx
0x00007f37b403e9ea:  48 89 C7                         mov  rdi, rax
0x00007f37b403e9ed:  89 F3                            mov  ebx, esi
0x00007f37b403e9ef:  FF 15 3B A3 E0 00                call qword ptr [rip + 0xe0a33b]
0x00007f37b403e9f5:  48 85 C0                         test rax, rax
0x00007f37b403e9f8:  74 46                            je   0x7f37b403ea40
0x00007f37b403e9fa:  49 8B BC 24 78 03 00 00          mov  rdi, qword ptr [r12 + 0x378]
0x00007f37b403ea02:  48 85 FF                         test rdi, rdi
0x00007f37b403ea05:  74 49                            je   0x7f37b403ea50
0x00007f37b403ea07:  FF 15 FB 55 E0 00                call qword ptr [rip + 0xe055fb]
0x00007f37b403ea0d:  48 89 C7                         mov  rdi, rax
0x00007f37b403ea10:  E8 8B 1C FF FF                   call 0x7f37b40306a0
0x00007f37b403ea15:  20 C3                            and  bl, al
0x00007f37b403ea17:  74 1F                            je   0x7f37b403ea38
0x00007f37b403ea19:  49 8B BC 24 78 03 00 00          mov  rdi, qword ptr [r12 + 0x378]
0x00007f37b403ea21:  48 85 FF                         test rdi, rdi
0x00007f37b403ea24:  74 32                            je   0x7f37b403ea58
0x00007f37b403ea26:  FF 15 DC 55 E0 00                call qword ptr [rip + 0xe055dc]
0x00007f37b403ea2c:  48 89 C7                         mov  rdi, rax
0x00007f37b403ea2f:  31 F6                            xor  esi, esi
0x00007f37b403ea31:  E8 8A 1C FF FF                   call 0x7f37b40306c0
0x00007f37b403ea36:  89 D8                            mov  eax, ebx
0x00007f37b403ea38:  5B                               pop  rbx
0x00007f37b403ea39:  41 5C                            pop  r12
0x00007f37b403ea3b:  5D                               pop  rbp
0x00007f37b403ea3c:  C3                               ret  
0x00007f37b403ea3d:  0F 1F 00                         nop  dword ptr [rax]
0x00007f37b403ea40:  5B                               pop  rbx
0x00007f37b403ea41:  31 C0                            xor  eax, eax
0x00007f37b403ea43:  41 5C                            pop  r12
0x00007f37b403ea45:  5D                               pop  rbp
0x00007f37b403ea46:  C3                               ret  

And here's the disassembly of Unsafe_Park() which calls Parker::park(bool, long at ...fc1496 (notice that Parker::park(bool, long is missing from the native stack trace in the hs_err file):

0000000000fc13c0 <Unsafe_Park>:
  fc13c0:	55                   	push   %rbp
  fc13c1:	48 89 e5             	mov    %rsp,%rbp
  fc13c4:	41 57                	push   %r15
  fc13c6:	41 56                	push   %r14
  fc13c8:	41 55                	push   %r13
  fc13ca:	41 54                	push   %r12
  fc13cc:	49 89 fd             	mov    %rdi,%r13
  fc13cf:	53                   	push   %rbx
  fc13d0:	48 8d 9f 48 fc ff ff 	lea    -0x3b8(%rdi),%rbx
  fc13d7:	41 89 d6             	mov    %edx,%r14d
  fc13da:	49 89 cc             	mov    %rcx,%r12
  fc13dd:	48 83 ec 78          	sub    $0x78,%rsp
  fc13e1:	8b 83 68 04 00 00    	mov    0x468(%rbx),%eax
  fc13e7:	2d ad de 00 00       	sub    $0xdead,%eax
  fc13ec:	83 f8 01             	cmp    $0x1,%eax
  fc13ef:	0f 86 f3 01 00 00    	jbe    fc15e8 <Unsafe_Park+0x228>
  fc13f5:	48 8d 05 a4 eb 7a 00 	lea    0x7aeba4(%rip),%rax        # 176ffa0 <UseSystemMemoryBarrier>
  fc13fc:	80 38 00             	cmpb   $0x0,(%rax)
  fc13ff:	c7 83 44 04 00 00 06 	movl   $0x6,0x444(%rbx)
  fc1406:	00 00 00 
  fc1409:	75 05                	jne    fc1410 <Unsafe_Park+0x50>
  fc140b:	f0 83 04 24 00       	lock addl $0x0,(%rsp)
  fc1410:	48 8b 83 48 04 00 00 	mov    0x448(%rbx),%rax
  fc1417:	a8 01                	test   $0x1,%al
  fc1419:	0f 85 b1 01 00 00    	jne    fc15d0 <Unsafe_Park+0x210>
  fc141f:	8b 83 40 04 00 00    	mov    0x440(%rbx),%eax
  fc1425:	a8 0c                	test   $0xc,%al
  fc1427:	74 08                	je     fc1431 <Unsafe_Park+0x71>
  fc1429:	48 89 df             	mov    %rbx,%rdi
  fc142c:	e8 2f 7a 96 ff       	call   928e60 <JavaThread::handle_special_runtime_exit_condition()>
  fc1431:	4c 8d 3d a8 37 7b 00 	lea    0x7b37a8(%rip),%r15        # 1774be0 <JfrEventSetting::_jvm_event_settings>
  fc1438:	c7 83 44 04 00 00 06 	movl   $0x6,0x444(%rbx)
  fc143f:	00 00 00 
  fc1442:	48 c7 45 90 00 00 00 	movq   $0x0,-0x70(%rbp)
  fc1449:	00 
  fc144a:	48 c7 45 98 00 00 00 	movq   $0x0,-0x68(%rbp)
  fc1451:	00 
  fc1452:	c6 45 a0 00          	movb   $0x0,-0x60(%rbp)
  fc1456:	c6 45 a1 00          	movb   $0x0,-0x5f(%rbp)
  fc145a:	41 80 bf 01 01 00 00 	cmpb   $0x0,0x101(%r15)
  fc1461:	00 
  fc1462:	c6 45 a2 00          	movb   $0x0,-0x5e(%rbp)
  fc1466:	0f 85 8c 01 00 00    	jne    fc15f8 <Unsafe_Park+0x238>
  fc146c:	31 d2                	xor    %edx,%edx
  fc146e:	48 8d bd 60 ff ff ff 	lea    -0xa0(%rbp),%rdi
  fc1475:	4d 85 e4             	test   %r12,%r12
  fc1478:	0f 95 c2             	setne  %dl
  fc147b:	48 89 de             	mov    %rbx,%rsi
  fc147e:	e8 3d 16 00 00       	call   fc2ac0 <JavaThreadParkedState::JavaThreadParkedState(JavaThread*, bool)>
  fc1483:	31 f6                	xor    %esi,%esi
  fc1485:	48 8d bb 10 06 00 00 	lea    0x610(%rbx),%rdi
  fc148c:	45 84 f6             	test   %r14b,%r14b
  fc148f:	40 0f 95 c6          	setne  %sil
  fc1493:	4c 89 e2             	mov    %r12,%rdx
  fc1496:	e8 f5 19 d4 ff       	call   d02e90 <Parker::park(bool, long)>
  fc149b:	41 80 bf 01 01 00 00 	cmpb   $0x0,0x101(%r15)

And finally Parker::park(bool, long which calls JavaThread::is_interrupted(bool):

0000000000d02e90 <Parker::park(bool, long)>:
  d02e90:       31 c9                   xor    %ecx,%ecx
  d02e92:       87 0f                   xchg   %ecx,(%rdi)
  d02e94:       85 c9                   test   %ecx,%ecx
  d02e96:       7e 08                   jle    d02ea0 <Parker::park(bool, long)+0x10>
  d02e98:       f3 c3                   repz ret 
  d02e9a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  d02ea0:       55                      push   %rbp
  d02ea1:       48 89 e5                mov    %rsp,%rbp
  d02ea4:       41 57                   push   %r15
  d02ea6:       41 56                   push   %r14
  d02ea8:       41 55                   push   %r13
  d02eaa:       41 54                   push   %r12
  d02eac:       49 89 d6                mov    %rdx,%r14
  d02eaf:       53                      push   %rbx
  d02eb0:       41 89 f7                mov    %esi,%r15d
  d02eb3:       41 89 f4                mov    %esi,%r12d
  d02eb6:       48 89 fb                mov    %rdi,%rbx
  d02eb9:       48 83 ec 58             sub    $0x58,%rsp
  d02ebd:       66 48 8d 3d bb 70 a2    data16 lea 0xa270bb(%rip),%rdi        # 1729f80 <_GLOBAL_OFFSET_TABLE_+0xb90>
  d02ec4:       00 
  d02ec5:       66 66 48 e8 f3 fc 59    data16 data16 rex.W call 2a2bc0 <__tls_get_addr@plt>
  d02ecc:       ff 
  d02ecd:       31 f6                   xor    %esi,%esi
  d02ecf:       4c 8b 28                mov    (%rax),%r13
  d02ed2:       4c 89 ef                mov    %r13,%rdi
  d02ed5:       e8 f6 1a c2 ff          call   9249d0 <JavaThread::is_interrupted(bool)>
  d02eda:       4d 85 f6                test   %r14,%r14

It copies rdi from r13 before the call which is plausible, because r13 is 0 as well in the hs_err file when the crash happens. It also initializes the first method paramter (i.e. esi) to 0 which is also plausible, because Parker::park() calls JavaThread::is_interrupted() with a false argument:

void Parker::park(bool isAbsolute, jlong time) {

  // Optional fast-path check:
  // Return immediately if a permit is available.
  // We depend on Atomic::xchg() having full barrier semantics
  // since we are doing a lock-free update to _counter.
  if (Atomic::xchg(&_counter, 0) > 0) return;

  JavaThread *jt = JavaThread::current();

  // Optional optimization -- avoid state transitions if there's
  // an interrupt pending.
  if (jt->is_interrupted(false)) {
    return;
  }

So it looks like JavaThread::current() which reads the current JavaThread from a thread local variable returns 0 for the current thread. I'm not sure how this can happen? According to the comment in the assertion in Thread::current() this can happen if a thread has already been detached:

class JavaThread: public Thread {
  ...
  static JavaThread* current() {
    return JavaThread::cast(Thread::current());
  }
  ...
}
...
inline Thread* Thread::current() {
  Thread* current = current_or_null();
  assert(current != nullptr, "Thread::current() called on detached thread");
  return current;
}
...
inline Thread* Thread::current_or_null() {
#ifndef USE_LIBRARY_BASED_TLS_ONLY
  return _thr_current;
#else
  if (ThreadLocalStorage::is_initialized()) {
    return ThreadLocalStorage::thread();
  }
  return nullptr;
#endif
}
...
#ifndef USE_LIBRARY_BASED_TLS_ONLY
// Current thread is maintained as a thread-local variable
THREAD_LOCAL Thread* Thread::_thr_current = nullptr;
#endif

The hs_err file also doesn't contain a list of active Java threads because of:

[error occurred during error reporting (printing all threads), id 0xb, SIGSEGV (0xb) at pc=0x00007f37b46ad69d]

so it seems that indeed the HotSpot internal bookkeeping of Java threads was corrupted.

simonis avatar Jun 26 '25 12:06 simonis

@simonis Thanks for the explanation. I submitted a ticket in dd-trace-java as well

metodski avatar Jul 14 '25 10:07 metodski