openj9
openj9 copied to clipboard
Skynet crash vmState=0x00020002
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_x86-64_windows_Release_testList_0/5/ jdk_lang_j9_0 java/lang/Thread/virtual/stress/Skynet.java#default
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk21_j9_sanity.openjdk_x86-64_windows_Release_testList_0/5/openjdk_test_output.tar.gz
09:51:03 ACTION: main -- Failed. Unexpected exit from test [exit code: -1]
09:51:03 REASON: User specified action: run main/othervm/timeout=300 -Xmx1g -Xssi2K -Xnocompressedrefs Skynet
09:51:03 TIME: 827.062 seconds
09:51:03 messages:
09:51:03 command: main -Xmx1g -Xssi2K -Xnocompressedrefs Skynet
09:51:03 reason: User specified action: run main/othervm/timeout=300 -Xmx1g -Xssi2K -Xnocompressedrefs Skynet
09:51:03 started: Sun Jan 07 09:37:05 EST 2024
09:51:03 Mode: othervm [/othervm specified]
09:51:03 finished: Sun Jan 07 09:50:52 EST 2024
09:51:03 elapsed time (seconds): 827.062
09:51:03 configuration:
09:51:03 STDOUT:
09:51:03 Result: 499999500000 in 293703 ms
09:51:03 Result: 499999500000 in 265969 ms
09:51:03 Result: 499999500000 in 53109 ms
09:51:03 Result: 499999500000 in 38422 ms
09:51:03 STDERR:
09:51:03 Unhandled exception
09:51:03 Type=Segmentation error vmState=0x00020002
09:51:03 Windows_ExceptionCode=c0000005 J9Generic_Signal=00000004 ExceptionAddress=00007FFF3FAB64EA ContextFlags=0010005f
09:51:03 Handler1=00007FFF43EFB5B0 Handler2=00007FFF43DDABA0 InaccessibleReadAddress=0000000000000000
09:51:03 RDI=0000021CBD4F11E8 RSI=0000021CBD4F11E8 RAX=0000000099669966 RBX=00007FF4EBF4EE98
09:51:03 RCX=0000000000000003 RDX=0000021CBD4F11E8 R8=00007FF4EBF4EE98 R9=0000000000000001
09:51:03 R10=0000000800000000 R11=0000000000000246 R12=00007FFF3FD03790 R13=0000000000000000
09:51:03 R14=0000000000000001 R15=000004D5C0D73EC0
09:51:03 RIP=00007FFF3FAB64EA RSP=000000C32B57F350 RBP=0000021CB21C5320 EFLAGS=0000000000010206
09:51:03 FS=0053 ES=002B DS=002B
09:51:03 XMM0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
09:51:03 Module=C:\Users\jenkins\workspace\Test_openjdk21_j9_sanity.openjdk_x86-64_windows_Release_testList_0\jdkbinary\j2sdk-image\bin\default\j9gc_full29.dll
09:51:03 Module_base_address=00007FFF3F9E0000 Offset_in_DLL=00000000000d64ea
09:51:03 Target=2_90_20240107_9 (Windows Server 2019 10.0 build 17763)
09:51:03 CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
09:51:03 ----------- Stack Backtrace -----------
09:51:03 J9VMDllMain+0xd54da (0x00007FFF3FAB64EA [j9gc_full29+0xd64ea])
09:51:03 J9VMDllMain+0xdb3b4 (0x00007FFF3FABC3C4 [j9gc_full29+0xdc3c4])
09:51:03 J9VMDllMain+0xd1419 (0x00007FFF3FAB2429 [j9gc_full29+0xd2429])
09:51:03 J9VMDllMain+0x14ec5d (0x00007FFF3FB2FC6D [j9gc_full29+0x14fc6d])
09:51:03 J9VMDllMain+0x14fdf1 (0x00007FFF3FB30E01 [j9gc_full29+0x150e01])
09:51:03 j9port_isCompatible+0x1a1bb (0x00007FFF43DDCD9B [j9prt29+0x1cd9b])
09:51:03 J9VMDllMain+0x14ff4b (0x00007FFF3FB30F5B [j9gc_full29+0x150f5b])
09:51:03 omrthread_get_category+0xa42 (0x00007FFF46FF4242 [j9thr29+0x4242])
09:51:03 _o_exp+0x5a (0x00007FFF5282268A [ucrtbase+0x2268a])
09:51:03 BaseThreadInitThunk+0x14 (0x00007FFF557B7AC4 [KERNEL32+0x17ac4])
09:51:03 RtlUserThreadStart+0x21 (0x00007FFF55BEA4E1 [ntdll+0x5a4e1])
@fengxue-IS @babsingh @amicic @dmitripivkine @LinHu2016 @TobiAjila fyi
09:51:03 Module=...\j9gc_full29.dll: The crash occurred in the GC library. In the above native stack, symbols are not resolved. The corefile can be opened in windbg, and merging the debug-image.tar.gz files in OpenJ9-JDK21-x86-64_windows-20240107-070702.tar.gz might resolve the symbols in the native stack.
This crash occur in GC Worker thread in Balanced STW GMP increment at attempt to scan bogus object pointer 0x7FF4EBF4EE98 (mid-object !j9object 0x7FF4EBF4EE48), taken from the work packet at 0x21D78472940:
0x7FF4EBF4EE40 : 000000000000000a 0000021cbd8fe810 [ ................ ]
0x7FF4EBF4EE50 : 0000000000000000 0000000000000000 [ ................ ]
0x7FF4EBF4EE60 : 00000000004fcd36 0000000000000000 [ 6.O............. ]
0x7FF4EBF4EE70 : 0000000000000000 00007ff4bc21a000 [ ..........!..... ]
0x7FF4EBF4EE80 : 00007ff4bc248e80 00007ff4bc915a28 [ ..$.....(Z...... ]
0x7FF4EBF4EE90 : 0000000000000000 0000000000000000 [ ................ ] <---- mid object
0x7FF4EBF4EEA0 : 0000000000000000 00007ff4bc20e868 [ ........h. ..... ]
0x7FF4EBF4EEB0 : 00007ff4ebf4ef38 0000000000000000 [ 8............... ]
!mm_packet 0x0000021D7846FDC0
MM_Packet at 0x21d7846fdc0 {
Fields for MM_Base:
Fields for MM_BaseNonVirtual:
0x0: const U8* _typeId = !j9x 0x0000000000000000
Fields for MM_Packet:
0x8: U64* _baseAddress = !j9x 0x0000021D78471B00
0x10: U64* _basePtr = !j9x 0x0000021D78471B00
0x18: U64* _topPtr = !j9x 0x0000021D78472B00
0x20: U64* _currentPtr = !j9x 0x0000021D78472940 <------ current pointer
0x28: U64 _sublistIndex = 0x0000000000000000 (0)
0x30: class MM_EnvironmentBase* _owner = !mm_environmentvlhgc 0x0000021CBD4F11E8
0x38: class MM_Packet* _next = !mm_packet 0x0000021D784707E0
0x40: class MM_Packet* _previous = !mm_packet 0x0000000000000000
}
0x21D78472930 : 00007ff4f06a39d8 00007ff4f06a3be8 [ .9j......;j..... ]
0x21D78472940 : 00007ff4ebf4ee98 00007ff4f06a3df8 [ .........=j..... ] <------- bogus pointer in WP
0x21D78472950 : 00007ff4f06a4008 00007ff4f06a82f8 [ [email protected]..... ]
So, now there is a question how this bogus (most likely stale) pointer was added to the WP. Searching across system core:
> !findall u64 0x7FF4EBF4EE98
Scanning memory for 98 ee f4 eb f4 7f 00 00 aligned to 8 starting from 0x0
Match found at 0xc32b57ea00
Match found at 0xc32b57ec50
Match found at 0xc32b57ecd0
Match found at 0xc32b57ecf8
Match found at 0xc32b57f3d0 <--- current RSP=000000C32B57F350, so in the c-stack
Match found at 0x21d78472940 <-- value in WP
Match found at 0x7ff4cef7bf98 <-- unmarked (dead) object in the heap
No more matches
First five values seems came from c-stack of crashed thread, next one from WP itself and last one from unmarked object in the heap (btw it expects Ljava/util/concurrent/SynchronousQueue$TransferStack$SNode type). So, none of these location might be attributed as a source of such pointer in the WP (except I have recognized one of them wrong). It is possible that bogus address was added on concurrent stage of marking (so, some time ago) from scanning of native structure which memory was released later on and does not exist at the moment.
Adding VM label as well, this problem is not GC internal most likely.
My statement that object
> !j9object 0x7FF4CEF7BF80
!J9Object 0x00007FF4CEF7BF80 {
struct J9Class* clazz = !j9class 0x21CD163DC00 // java/util/concurrent/SynchronousQueue$TransferStack$SNode
Object flags = 0x00000010;
J lockword = 0x0000000000000000 (offset = 0) (java/lang/Object) <hidden>
Ljava/util/concurrent/SynchronousQueue$TransferStack$SNode; next = !fj9object 0x0 (offset = 8) (java/util/concurrent/SynchronousQueue$TransferStack$SNode)
Ljava/util/concurrent/SynchronousQueue$TransferStack$SNode; match = !fj9object 0x7ff4ebf4ee98 (offset = 16) (java/util/concurrent/SynchronousQueue$TransferStack$SNode)
Ljava/lang/Thread; waiter = !fj9object 0x0 (offset = 24) (java/util/concurrent/SynchronousQueue$TransferStack$SNode)
Ljava/lang/Object; item = !fj9object 0x7ff4cef7bf68 (offset = 32) (java/util/concurrent/SynchronousQueue$TransferStack$SNode)
I mode = 0x00000001 (offset = 40) (java/util/concurrent/SynchronousQueue$TransferStack$SNode)
}
is dead might be not correct. Balanced GC uses two mark maps and DDR might look to the wrong one by default. And this object is marked in another one. Assuming we got bad object pointer from this object (it must be heap object scanned) trying to search it across the core:
> !findall u64 0x7FF4CEF7BF80
Scanning memory for 80 bf f7 ce f4 7f 00 00 aligned to 8 starting from 0x0
Match found at 0x21d273025c0
Match found at 0x21d27302670
No more matches
> !findheader 0x21d273025c0
Searching memory allocation header for 0x0000021D273025C0
Found memory allocation header, !j9x 0x21d272ff950,0x3008
J9MemTag at 0x21d272ff930 {
U_32 eyeCatcher = 0xb1234567;
U_32 sumCheck = 0x9141062a;
UDATA allocSize = 0x3008;
char* callSite = f:\users\jenkins\workspace\build_jdk21_x86-64_windows_release\openj9\runtime\vm\vmthread.cpp:1434;
struct OMRMemCategory* category = !omrmemcategory 0x7fff44b35728 (Java Stack);
}
I can not find this stack address for any regular thread. Also there are a lot of virtual threads around, !vthreads dumps many of them but crash eventually. And I can't associate this address with any virtual thread stack either.
I think most likely GC has discovered bad object pointer in one of virtual threads stacks.
We also cache Java stacks once a virtual thread completes so that it can be reused with a future virtual thread. @fengxue-IS Is it possible for the GC to scan one of the cached/unused stacks?
https://github.com/eclipse-openj9/openj9/blob/e28d098f29278683b8c5208cdd6d62613325a0a7/runtime/vm/ContinuationHelpers.cpp#L39
https://github.com/eclipse-openj9/openj9/blob/e28d098f29278683b8c5208cdd6d62613325a0a7/runtime/vm/ContinuationHelpers.cpp#L362
This problem is not reproducible, moving to 0.45 release.
Was not seen since end of January. Closing for now, please reopen if it occur again.