openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

jdk25 openjdk java/foreign/TestBufferStackStress timeout

Open pshipton opened this issue 5 months ago • 90 comments

https://openj9-jenkins.osuosl.org/job/Test_openjdk25_j9_sanity.openjdk_ppc64le_linux_Nightly_testList_0/1 jdk_foreign_1 java/foreign/TestBufferStackStress.java

23:14:21  Unhandled exception
23:14:21  Type=Segmentation error vmState=0x00000000
23:14:21  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
23:14:21  Handler1=00006165E9044320 Handler2=00006165E93A8220
23:14:21  R0=0000000044884454 R1=00006165C3E4B550 R2=00006165E3DEE800 R3=00000000005B9500
23:14:21  R4=00006165C8A10EB8 R5=00006165C8A10EA0 R6=00000000FFF4EE40 R7=0000000000000018
23:14:21  R8=00006165C3E4B6F0 R9=00006165C8A10EA0 R10=00006165C8A10EB8 R11=00006165C8A10EA0
23:14:21  R12=00006165C8A10ED8 R13=00006165C3E568E0 R14=000000000061EB00 R15=00000000005B9500
23:14:21  R16=0000000000000000 R17=FFFFFFFFFFFFFFFF R18=0000000000000000 R19=0000000000000000
23:14:21  R20=00000000FFEF1140 R21=0000000000033490 R22=00000000E03ECD90 R23=0000000000000029
23:14:21  R24=00000000FFEF1118 R25=00000000FFF30218 R26=00000000FFEF1118 R27=0000000000000008
23:14:21  R28=00000000E04231A8 R29=0000000000000008 R30=0000000000000000 R31=00000000005B9500
23:14:21  NIP=00006165E3AA4BA0 MSR=800000000280D033 ORIG_GPR3=00006165E3AB8504 CTR=00006165E3AB8498
23:14:21  LINK=00006165E3AB8508 XER=00000000000000DD CCR=0000000044884454 SOFTE=0000000000000001
23:14:21  TRAP=0000000000000300 DAR=0000000000000060 dsisr=0000000040000000 RESULT=0000000000000000
23:14:21  FPR0=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR1=40527e8140000000 (f: 1073741824.000000, d: 7.397664e+01)
23:14:21  FPR2=4125fc0000000000 (f: 0.000000, d: 7.203840e+05)
23:14:21  FPR3=3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
23:14:21  FPR4=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR5=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
23:14:21  FPR6=3fd4043057b6ee09 (f: 1471606272.000000, d: 3.127557e-01)
23:14:21  FPR7=bff0000000000000 (f: 0.000000, d: -1.000000e+00)
23:14:21  FPR8=bfdffffef20a4123 (f: 4060758272.000000, d: -4.999997e-01)
23:14:21  FPR9=3fd5575b0be00b6a (f: 199232368.000000, d: 3.334568e-01)
23:14:21  FPR10=bfd00ea348b88334 (f: 1220051712.000000, d: -2.508934e-01)
23:14:21  FPR11=000000000000001e (f: 30.000000, d: 1.482197e-322)
23:14:21  FPR12=3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
23:14:21  FPR13=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  Module=/home/jenkins/workspace/Grinder_iteration_0/jdkbinary/j2sdk-image/lib/default/libj9jit29.so
23:14:21  Module_base_address=00006165E2E00000
23:14:21  Target=2_90_20250813_14 (Linux 6.8.0-45-generic)
23:14:21  CPU=ppc64le (4 logical CPUs) (0x1ea0e0000 RAM)
23:14:21  ----------- Stack Backtrace -----------
23:14:21  old_fast_jitLookupInterfaceMethod+0x60 (0x00006165E3AA4BA0 [libj9jit29.so+0xca4ba0])
23:14:21   (0x00006165E3AB8508 [libj9jit29.so+0xcb8508])
23:14:21  runJavaThread+0x250 (0x00006165E9018BF0 [libj9vm29.so+0x18bf0])
23:14:21  javaProtectedThreadProc+0xc8 (0x00006165E90C0488 [libj9vm29.so+0xc0488])
23:14:21  omrsig_protect+0x3e4 (0x00006165E93A96D4 [libj9prt29.so+0x396d4])
23:14:21  javaThreadProc+0x60 (0x00006165E90BB900 [libj9vm29.so+0xbb900])
23:14:21  thread_wrapper+0x190 (0x00006165E933CE80 [libj9thr29.so+0xce80])
23:14:21   (0x00006165E9AB2A9C [libc.so.6+0xb2a9c])
23:14:21  ---------------------------------------
23:14:21  JVMDUMP039I Processing dump event "gpf", detail "" at 2025/08/14 02:52:06 - please wait.
23:14:21  Unhandled exception
23:14:21  Type=Segmentation error vmState=0x0002000f
23:14:21  J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
23:14:21  Handler1=00006165E9044780 Handler2=00006165E93A8220
23:14:21  R0=00006165E90A31A4 R1=00006165C1AAB6D0 R2=00006165E92E6B00 R3=0000616554004E00
23:14:21  R4=00006165E92E3A40 R5=00006165E90A23B0 R6=0000000000000006 R7=0000000024600002
23:14:21  R8=000000000061EB20 R9=0000000000000000 R10=0000000000000000 R11=00006165E8AA1850
23:14:21  R12=0000000022884828 R13=00006165C1AB68E0 R14=00006165E9CFD2C8 R15=00000000191116E2
23:14:21  R16=00006165C1A20000 R17=0000000000000001 R18=00006165C1AAF8E0 R19=000000004A39564D
23:14:21  R20=00006165E9ED10D0 R21=0000000000000000 R22=00006165E9ED0000 R23=00006165E99F1F88
23:14:21  R24=00006165E926C540 R25=0000000000000000 R26=00006165E92E3718 R27=0000000004600002
23:14:21  R28=000000000061EB00 R29=0000000000000006 R30=0000000000293600 R31=00006165C1AAB7F8
23:14:21  NIP=00006165E90A3A50 MSR=800000000280D033 ORIG_GPR3=00006165E90A33E0 CTR=00006165E90A3A20
23:14:21  LINK=00006165E90A31A4 XER=00000000200400DD CCR=0000000042884828 SOFTE=0000000000000001
23:14:21  TRAP=0000000000000300 DAR=0000000000000008 dsisr=0000000040000000 RESULT=0000000000000000
23:14:21  FPR0=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR1=40539df400000000 (f: 0.000000, d: 7.846802e+01)
23:14:21  FPR2=4138ee0000000000 (f: 0.000000, d: 1.633792e+06)
23:14:21  FPR3=3fce840b4ac4e4d2 (f: 1254417664.000000, d: 2.384047e-01)
23:14:21  FPR4=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR5=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
23:14:21  FPR6=bfd01eae7f513a67 (f: 2136029824.000000, d: -2.518727e-01)
23:14:21  FPR7=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR8=bfdffffef20a4123 (f: 4060758272.000000, d: -4.999997e-01)
23:14:21  FPR9=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR10=0000000200000000 (f: 0.000000, d: 4.243992e-314)
23:14:21  FPR11=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR12=0000000200000002 (f: 2.000000, d: 4.243992e-314)
23:14:21  FPR13=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:14:21  Module=/home/jenkins/workspace/Grinder_iteration_0/jdkbinary/j2sdk-image/lib/default/libj9vm29.so
23:14:21  Module_base_address=00006165E9000000
23:14:21  Target=2_90_20250813_14 (Linux 6.8.0-45-generic)
23:14:21  CPU=ppc64le (4 logical CPUs) (0x1ea0e0000 RAM)
23:14:21  ----------- Stack Backtrace -----------
23:14:21  walkStackFrames+0x970 (0x00006165E90A3A50 [libj9vm29.so+0xa3a50])
23:14:21  _ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadS1_PvPFvP8J9JavaVMPP8J9ObjectS2_P16J9StackWalkStatePKvEbb+0x5c (0x00006165E865F8FC [libj9gc29.so+0x5f8fc])
23:14:21  _ZN14MM_RootScanner13scanOneThreadEP18MM_EnvironmentBaseP10J9VMThreadPv+0x14c (0x00006165E865099C [libj9gc29.so+0x5099c])
23:14:21  _ZN14MM_RootScanner11scanThreadsEP18MM_EnvironmentBase+0x114 (0x00006165E864E9D4 [libj9gc29.so+0x4e9d4])
23:14:21  _ZN14MM_RootScanner9scanRootsEP18MM_EnvironmentBase+0x80 (0x00006165E8652980 [libj9gc29.so+0x52980])
23:14:21  _ZN12MM_Scavenger24workThreadGarbageCollectEP22MM_EnvironmentStandard+0x2a8 (0x00006165E87F3648 [libj9gc29.so+0x1f3648])
23:14:21  _ZN23MM_ParallelScavengeTask3runEP18MM_EnvironmentBase+0x1c (0x00006165E886FC5C [libj9gc29.so+0x26fc5c])
23:14:21  _ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x278 (0x00006165E878B638 [libj9gc29.so+0x18b638])
23:14:21  _Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x160 (0x00006165E8789C60 [libj9gc29.so+0x189c60])
23:14:21  omrsig_protect+0x3e4 (0x00006165E93A96D4 [libj9prt29.so+0x396d4])
23:14:21  dispatcher_thread_proc+0x50 (0x00006165E8789750 [libj9gc29.so+0x189750])
23:14:21  thread_wrapper+0x190 (0x00006165E933CE80 [libj9thr29.so+0xce80])
23:14:21   (0x00006165E9AB2A9C [libc.so.6+0xb2a9c])
23:14:21  ---------------------------------------
23:14:21  JVMDUMP039I Processing dump event "gpf", detail "" at 2025/08/14 02:52:06 - please wait.
23:14:21  
23:14:21  TEST RESULT: Error. "junit" action timed out with a timeout of 960 seconds on agent 2

pshipton avatar Jul 14 '25 21:07 pshipton

The failure occurred once in a 250x grinder run: https://openj9-jenkins.osuosl.org/job/Grinder/4565

Although the logs mention a timeout, the actual issue appears to be a JIT crash, followed by a GC crash during stack walking. It is possible the JIT introduced a corrupt artifact on the stack. I have updated the issue description with the segfault and relevant native crash details.

Here is the sub-grinder job where the crash occurred: https://openj9-jenkins.osuosl.org/job/Grinder_iteration_0/722/consoleFull

Core files: https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Grinder_iteration_0/722/openjdk_test_output.tar.gz

fyi @hzongaro

babsingh avatar Aug 14 '25 18:08 babsingh

@babsingh This result file just 2,5M, so core is not stored. Interesting that crash stack shows walkStackFrames, not JIT Frame walker. So either it was Interpreted frame or code crashed on the way to/from JIT Frame walker.

dmitripivkine avatar Aug 14 '25 18:08 dmitripivkine

Ah, the core files weren’t generated; the test timed out while attempting to generate them after the GPF.

There’s a call to jitWalkStackFrames within walkStackFrames to handle transitions from the JIT to interpreter. Since the initial crash occurred in the JIT, it’s likely that the failure happens during the JIT to interpreter transition.

babsingh avatar Aug 14 '25 18:08 babsingh

I was able to reproduce the failure twice in an internal grinder run, but again, there were no core files.

the test timed out while attempting to generate them after the GPF

Is the test timing out while attempting to generate the core files or is the segfault happening after the test is in the process of timing out?

hzongaro avatar Aug 16 '25 01:08 hzongaro

@zl-wang FYI since this is on P

vij-singh avatar Aug 18 '25 13:08 vij-singh

@rmnattas please take a look from JIT perspective

zl-wang avatar Aug 18 '25 13:08 zl-wang

I did manage to get a core file from an internal grinder run, but I hadn't looked at it yet:

./aqa-tests/TKG/output_17553074571672/jdk_custom_0_ITER_17/work/java/foreign/TestBufferStackStress/core.20250816.013245.1716304.0001.dmp

hzongaro avatar Aug 18 '25 14:08 hzongaro

And two more core files from this iteration of that same grinder run.

hzongaro avatar Aug 18 '25 14:08 hzongaro

I took a quick look in kca at the core file that I mentioned above

./aqa-tests/TKG/output_17553074571672/jdk_custom_0_ITER_17/work/java/foreign/TestBufferStackStress/core.20250816.013245.1716304.0001.dmp

Signal 11 (SIGSEGV) at 0x000077e26b9797bc

Reading the JVM Crash Information...
   The register context for the failing thread (t64) was extracted from the J9RAS data. Use the "reg context 3" command to select this context.
   Fault information @ 0x000077e228b094f0
   Failing J9VMThread  0x0000000000654000
   Signal 11 (SIGSEGV)
   Signal Code 1


(kca) reg context 3
(kca) where
0x000077e24a99a9c0 {jdk/internal/foreign/SlicingAllocator.canAllocate} JIT [0x11a6ab0]
0x000077e24a9ac434 {jdk/internal/foreign/BufferStack.pushFrame} JIT [0x11a6b00]
0x000077e24a193de8 {TestBufferStackStress.lambda$stress$1} JIT [0x11a6ba0]

$iar = 0x000077e24a99a9c0 {jdk/internal/foreign/SlicingAllocator.canAllocate} +19

(kca) (0x000077e24a99a974)/184i
...
0x77e24a99a9b0 {jdk/.../SlicingAllocator.canAllocate} +15                10007e80 lwz       r3, 0x10(r30)
0x77e24a99a9b4 {jdk/.../SlicingAllocator.canAllocate} +16   -1:1         00008308 tdeqi     r3, 0 // getfield 6 jdk/.../SlicingAllocator.segment (Ljav
a/.../MemorySegment;)
0x77e24a99a9b8 {jdk/.../SlicingAllocator.canAllocate} +17                00004380 lwz       r2, 0(r3)
0x77e24a99a9bc {jdk/.../SlicingAllocator.canAllocate} +18                2e004254 rlwinm    r2, r2, 0, 0, 0x17
0x77e24a99a9c0 {jdk/.../SlicingAllocator.canAllocate} +19                780102e8 ld        r0, 0x178(r2)    <<<<<<

$r30 = 0x00000000e04b5f28 Obj - {jdk/internal/foreign/SlicingAllocator}T
$r3  = 0x00000000fe7bf6c0 Obj - {jdk/internal/foreign/NativeMemorySegmentImpl} +24


(kca) obj 0x00000000e04b5f28
 Object class type: {jdk/internal/foreign/SlicingAllocator} J9Class 0x00000000006b3300
     Hidden Fields: 1
>                       lockWord ( +4) = 0x0000000000000008 - <hidden>
Super class fields: {java/lang/Object}  Number of Fields: 0
      Class fields:   Number of Fields: 2
>                        segment (+16) = 0x00000000fe7bf6c0 (Ljava/lang/foreign/MemorySegment;) - 20012 Private Final  : Obj - {jdk/internal/foreign/NativeMemorySegmentImpl} +24 
>                             sp ( +8) = 0x0000000000000000 (J) - 3c0002 Private 

So it looks like the SlicingAllocator.segment field has somehow been corrupted - it has value that's 24 bytes from the start of a segment object.

hzongaro avatar Aug 20 '25 14:08 hzongaro

Looking at the core files in the other iteration I mentioned I see something very similar in ./aqa-tests/TKG/output_17553086823031/jdk_custom_0_ITER_7/work/java/foreign/TestBufferStackStress/core.20250815.214815.2893357.0001.dmp:

$iar = 0x0000728af2432140 {jdk/internal/foreign/SlicingAllocator.canAllocate} +19
...
0x728af2432130 {jdk/.../SlicingAllocator.canAllocate} +15                10007e80 lwz       r3, 0x10(r30)
0x728af2432134 {jdk/.../SlicingAllocator.canAllocate} +16   -1:1         00008308 tdeqi     r3, 0 // getfield 6 jdk/.../SlicingAllocator.segment (Ljav
a/.../MemorySegment;)
0x728af2432138 {jdk/.../SlicingAllocator.canAllocate} +17                00004380 lwz       r2, 0(r3)
0x728af243213c {jdk/.../SlicingAllocator.canAllocate} +18                2e004254 rlwinm    r2, r2, 0, 0, 0x17
0x728af2432140 {jdk/.../SlicingAllocator.canAllocate} +19                780102e8 ld        r0, 0x178(r2)
...
$r30 = 0x00000000e0452978 Obj - {jdk/internal/foreign/SlicingAllocator}T
$r3  = 0x00000000fc94d578 Obj - {jdk/internal/foreign/SlicingAllocator} +16
(kca) obj 0x00000000e0452978
 Object class type: {jdk/internal/foreign/SlicingAllocator} J9Class 0x00000000006b0500
     Hidden Fields: 1
>                       lockWord ( +4) = 0x0000000000000008 - <hidden>
Super class fields: {java/lang/Object}  Number of Fields: 0
      Class fields:   Number of Fields: 2
>                        segment (+16) = 0x00000000fc94d578 (Ljava/lang/foreign/MemorySegment;) - 20012 Private Final  : Obj - {jdk/internal/foreign/SlicingAllocator} +16 
>                             sp ( +8) = 0x0000000000000000 (J) - 3c0002 Private 

and in ./aqa-tests/TKG/output_1755313447470/jdk_custom_0_ITER_11/work/java/foreign/TestBufferStackStress/core.20250815.230956.2965167.0001.dmp

(kca) thread t52
Current thread: t52, TID 2965236, J9VMThread 0x61fe00
(kca) reg context 3
(kca) where
...
0x000070cbe8a19b40 {jdk/internal/foreign/SlicingAllocator.canAllocate} JIT [0x795d60]
0x000070cbe8a2412c {TestBufferStackStress.lambda$stress$1} JIT [0x795db0]
...
0x70cbe8a19b30 {jdk/.../SlicingAllocator.canAllocate} +15                10007e80 lwz       r3, 0x10(r30)
0x70cbe8a19b34 {jdk/.../SlicingAllocator.canAllocate} +16   -1:1         00008308 tdeqi     r3, 0 // getfield 6 jdk/.../SlicingAllocator.segment (Ljava/.../MemorySegment;)
0x70cbe8a19b38 {jdk/.../SlicingAllocator.canAllocate} +17                00004380 lwz       r2, 0(r3)
0x70cbe8a19b3c {jdk/.../SlicingAllocator.canAllocate} +18                2e004254 rlwinm    r2, r2, 0, 0, 0x17
0x70cbe8a19b40 {jdk/.../SlicingAllocator.canAllocate} +19                780102e8 ld        r0, 0x178(r2)

$r30 = 0x00000000e0174ce0 Obj - {jdk/internal/foreign/SlicingAllocator}T
$r3  = 0x00000000fad75490 Obj - {jdk/internal/foreign/NativeMemorySegmentImpl} +8 
(kca) obj 0x00000000e0174ce0
 Object class type: {jdk/internal/foreign/SlicingAllocator} J9Class 0x00000000006aa900
     Hidden Fields: 1
>                       lockWord ( +4) = 0x0000000000000008 - <hidden>
Super class fields: {java/lang/Object}  Number of Fields: 0
      Class fields:   Number of Fields: 2
>                        segment (+16) = 0x00000000fad75490 (Ljava/lang/foreign/MemorySegment;) - 20012 Private Final  : Obj - {jdk/internal/foreign/NativeMemorySegmentImpl} +8 
>                             sp ( +8) = 0x0000000000000000 (J) - 3c0002 Private 

hzongaro avatar Aug 20 '25 14:08 hzongaro

i suspected this fix here: https://github.com/eclipse-openj9/openj9/pull/22370 is applicable to this defect.

zl-wang avatar Aug 20 '25 15:08 zl-wang

Can we try this out with #22370?

vij-singh avatar Aug 21 '25 16:08 vij-singh

I am currently in the middle of making a build that includes https://github.com/eclipse-openj9/openj9/pull/22370 to see if the problem still occurs.

IBMJimmyk avatar Aug 21 '25 17:08 IBMJimmyk

Unfortunately, it did not help.

I ran an internal grinder to perform 1000 runs and got 1 failure which can be seen here: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder_iteration_1/6098/

The error message is different from earlier ones posted in this issue but, it isn't a new error message. It also showed up in one of Henry's grinder runs. So it isn't a result of using https://github.com/eclipse-openj9/openj9/pull/22370 .

The error message looks like this.

[2025-08-21T21:32:50.736Z] Unhandled exception
[2025-08-21T21:32:50.736Z] Type=Segmentation error vmState=0x00000000
[2025-08-21T21:32:50.736Z] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
[2025-08-21T21:32:50.736Z] Handler1=00007898A3D64B50 Handler2=00007898A81F90E0
[2025-08-21T21:32:50.736Z] R0=00007898877CAD40 R1=000078988609B570 R2=0000000000000000 R3=00000000FA2F9360
[2025-08-21T21:32:50.736Z] R4=0000000000000008 R5=0000000000000008 R6=00000000E0000000 R7=0000000000600000
[2025-08-21T21:32:50.736Z] R8=0000000000000000 R9=0000000000000020 R10=0000000000775788 R11=00000000007712F8
[2025-08-21T21:32:50.736Z] R12=0000000000000008 R13=00007898860A68E0 R14=0000000000775810 R15=000000000068F900
[2025-08-21T21:32:50.736Z] R16=00007898848B0038 R17=FFFFFFFFFFFFFFFF R18=0000000000000000 R19=0000000000000000
[2025-08-21T21:32:50.737Z] R20=0000000000000000 R21=0000000000000000 R22=00000000E03F5648 R23=00000000E03F7868
[2025-08-21T21:32:50.737Z] R24=00000000E03F7578 R25=00000000E007C3A8 R26=00000000E007C3A8 R27=00000000E0477448
[2025-08-21T21:32:50.737Z] R28=0000000000000001 R29=0000000000000008 R30=00000000E047EA60 R31=00000000E047EA38
[2025-08-21T21:32:50.737Z] NIP=00007898877C93C0 MSR=800000000280D033 ORIG_GPR3=00007898877CAD3C CTR=00007898A1BB3510
[2025-08-21T21:32:50.737Z] LINK=00007898877CAD40 XER=000000000000009E CCR=0000000042884824 SOFTE=0000000000000001
[2025-08-21T21:32:50.737Z] TRAP=0000000000000300 DAR=0000000000000178 dsisr=0000000040000000 RESULT=0000000000000000
[2025-08-21T21:32:50.737Z] FPR0=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR1=40590fdce0000000 (f: 3758096384.000000, d: 1.002479e+02)
[2025-08-21T21:32:50.737Z] FPR2=401ca5a5c0000000 (f: 3221225472.000000, d: 7.161765e+00)
[2025-08-21T21:32:50.737Z] FPR3=3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
[2025-08-21T21:32:50.737Z] FPR4=40099451c0000000 (f: 3221225472.000000, d: 3.197422e+00)
[2025-08-21T21:32:50.737Z] FPR5=3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
[2025-08-21T21:32:50.737Z] FPR6=3fd1058bc8a07ee1 (f: 3365961472.000000, d: 2.659635e-01)
[2025-08-21T21:32:50.737Z] FPR7=bff0000000000000 (f: 0.000000, d: -1.000000e+00)
[2025-08-21T21:32:50.737Z] FPR8=bfdffffef20a4123 (f: 4060758272.000000, d: -4.999997e-01)
[2025-08-21T21:32:50.737Z] FPR9=3fd5575b0be00b6a (f: 199232368.000000, d: 3.334568e-01)
[2025-08-21T21:32:50.737Z] FPR10=bfd00ea348b88334 (f: 1220051712.000000, d: -2.508934e-01)
[2025-08-21T21:32:50.737Z] FPR11=bfe01c5e5c70fa58 (f: 1550908032.000000, d: -5.034630e-01)
[2025-08-21T21:32:50.737Z] FPR12=3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
[2025-08-21T21:32:50.737Z] FPR13=00000000000bf100 (f: 782592.000000, d: 3.866518e-318)
[2025-08-21T21:32:50.737Z] FPR14=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR15=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR16=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR17=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR18=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR19=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR20=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR21=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR22=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR23=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR24=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR25=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR26=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR27=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR28=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR29=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR30=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] FPR31=0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2025-08-21T21:32:50.737Z] 
[2025-08-21T21:32:50.737Z] Compiled_method=jdk/internal/foreign/SlicingAllocator.canAllocate(JJ)Z
[2025-08-21T21:32:50.737Z] Target=2_90_20250821_34 (Linux 5.15.0-151-generic)
[2025-08-21T21:32:50.737Z] CPU=ppc64le (32 logical CPUs) (0x1f92e0000 RAM)
[2025-08-21T21:32:50.737Z] ----------- Stack Backtrace -----------
[2025-08-21T21:32:50.737Z]  (0x00007898877C93C0 [<unknown>+0x0])
[2025-08-21T21:32:50.737Z] runJavaThread+0x260 (0x00007898A3D38DC0 [libj9vm29.so+0x18dc0])
[2025-08-21T21:32:50.737Z] javaProtectedThreadProc+0xc0 (0x00007898A3DE2840 [libj9vm29.so+0xc2840])
[2025-08-21T21:32:50.737Z] omrsig_protect+0x168 (0x00007898A81FA378 [libj9prt29.so+0x3a378])
[2025-08-21T21:32:50.737Z] javaThreadProc+0x60 (0x00007898A3DDDA10 [libj9vm29.so+0xbda10])
[2025-08-21T21:32:50.737Z] thread_wrapper+0x190 (0x00007898A818CC00 [libj9thr29.so+0xcc00])
[2025-08-21T21:32:50.737Z]  (0x00007898A89E5804 [libc.so.6+0xb5804])
[2025-08-21T21:32:50.737Z] ---------------------------------------
[2025-08-21T21:32:50.737Z] JVMDUMP039I Processing dump event "gpf", detail "" at 2025/08/21 17:32:45 - please wait.
[2025-08-21T21:32:50.737Z] 000000000068F900: Object neither in heap nor stack-allocated in thread ForkJoinPool-1-worker-25
[2025-08-21T21:32:50.737Z] 000000000068F900:	O-Slot=000078988609B780
[2025-08-21T21:32:50.737Z] 000000000068F900:	O-Slot value=00000000006C9200
[2025-08-21T21:32:50.737Z] 000000000068F900:	PC=00007898877C93C0
[2025-08-21T21:32:50.737Z] 000000000068F900:	framesWalked=1
[2025-08-21T21:32:50.737Z] 000000000068F900:	arg0EA=0000000000775880
[2025-08-21T21:32:50.737Z] 000000000068F900:	walkSP=00000000007757E8
[2025-08-21T21:32:50.737Z] 000000000068F900:	literals=0000000000000000
[2025-08-21T21:32:50.737Z] 000000000068F900:	jitInfo=00007898849BCFB8
[2025-08-21T21:32:50.737Z] 000000000068F900:	method=00000000006C3D40 (jdk/internal/foreign/SlicingAllocator.canAllocate(JJ)Z) (JIT)
[2025-08-21T21:32:50.737Z] 000000000068F900:	stack=0000000000770318-0000000000775B40

IBMJimmyk avatar Aug 22 '25 15:08 IBMJimmyk

could 0x00000000006C9200 be identified to be a J9Class pointer maybe?

zl-wang avatar Aug 22 '25 15:08 zl-wang

That looks to be the case:

(kca) j9c 0x6C9200
  Class Path/Name: {jdk/internal/foreign/BufferStack$PerThread$Frame} J9Class 0x00000000006c9200
      ClassObject: 0x00000000e03f7578
           Access: Final Synchronized  (30)
      ClassLoader: 0x00007898a4140768  Object: 0x00000000e0040330 {jdk/internal/loader/ClassLoaders$BootClassLoader}T
     SubClassLink: 0x0000000000634e00  {java/lang/invoke/LambdaForm$VH/0x00000000880049e0}
    Instance Size: 36 (with header: 52) bytes
        Hierarchy: (depth 1)
                   {java/lang/Object} J9Class 0x00000000000d7700
       Interfaces:
                   {java/lang/foreign/Arena} J9Class 0x00000000006b9b00
                   {java/lang/foreign/SegmentAllocator} J9Class 0x00000000006b9900
                   {java/lang/AutoCloseable} J9Class 0x0000000000164e00
        J9Methods: 0x00000000006c9948 (5 methods)

IBMJimmyk avatar Aug 22 '25 17:08 IBMJimmyk

now, happy chasing down in the complained JIT-ed method when/why obj->clazz is stored into an O-slot.

zl-wang avatar Aug 22 '25 18:08 zl-wang

From the javacore file:

2XHREGISTER      R2: 0000000000000000
2XHREGISTER      R3: 00000000FA2F9360
...
2XHREGISTER      R30: 00000000E047EA60

From the corefile:

0x7898877c93b0 {jdk/.../SlicingAllocator.canAllocate} +15                       10007e80 lwz       r3, 0x10(r30)       //<<---- r3 = 0xfa2f9360
0x7898877c93b4 {jdk/.../SlicingAllocator.canAllocate} +16          -1:1         00008308 tdeqi     r3, 0 // getfield 6 jdk/.../SlicingAllocator.segment (Ljava/.../MemorySegment;)
0x7898877c93b8 {jdk/.../SlicingAllocator.canAllocate} +17                       00004380 lwz       r2, 0(r3)           //<<---- r2 = 0x9c00a1e0
0x7898877c93bc {jdk/.../SlicingAllocator.canAllocate} +18                       2e004254 rlwinm    r2, r2, 0, 0, 0x17  //<<---- r2 = 0x9c00a100
0x7898877c93c0 {jdk/.../SlicingAllocator.canAllocate} +19                       780102e8 ld        r0, 0x178(r2)       //<<----CRASH, r2 = 0x0
(kca) (0xE047EA60+0x10)/X
%7 = 0x00000000e047ea70: fa2f9360

After +15, r3 is loaded with 0xfa2f9360. This matches the final value in the javacore file.

(kca) (0xfa2f9360)/X
%8 = 0x00000000fa2f9360: 9c00a1e0

After +17, r2 is loaded with to 0x9c00a1e0. +18 clears the upper 32 bits of the register and the lowest 8 bits. So r2 is now 0x9c00a100. This does not match the javacore file which thinks that r2 is 0x0. Although, 0xE047EA60 is supposed to be a SlicingAllocator object and it isn't.

(kca) obj 0xE047EA60
Error reading 0x00000000e047ea60 as an object.

So that's another problem.

IBMJimmyk avatar Aug 22 '25 18:08 IBMJimmyk

Looking at the same corefile, r3 is supposed to be an object of type jdk/internal/foreign/NativeMemorySegmentImpl. But it actually points in the middle of the object:

> whatis 0xfa2f9360
        heap #1 - name: Generational@7898a4136cf0
                0xfa2f9360 is within heap segment: f8000000 -- ff330000
                0xfa2f9360 is within an object on the heap:
                        offset 24 within jdk/internal/foreign/NativeMemorySegmentImpl instance @ 0xfa2f9348

This is the same problem Henry saw earlier.

This object is read out of the segment field of a jdk/internal/foreign/SlicingAllocator object. This field is private and final so something might have gone wrong when the object was constructed. Although, the constructor is very simple and segment is filled in based on input to the construtor. So the problem is more likely to be related the method that is calling the constructor. I am currently trying to look in that area to verify if segment was bad from the start or got corrupted later.

IBMJimmyk avatar Aug 26 '25 20:08 IBMJimmyk

Also, I'm looking at the result of !stackslots 0x68F900 when looking at a corefile and I'm trying to understand what I'm seeing:

<68f900> JIT frame: bp = 0x0000000000775858, pc = 0x00007898877C93C0, unwindSP = 0x0000000000775810, cp = 0x00000000006C3A90, arg0EA = 0x0000000000775880, jitInfo = 0x00007898849BCFB8
<68f900>        Method: jdk/internal/foreign/SlicingAllocator.canAllocate(JJ)Z !j9method 0x00000000006C3D40
<68f900>        Bytecode index = 1, inlineDepth = 0, PC offset = 0x000000000000004C
<68f900>        stackMap=0x00007898849BD0EC, slots=I16(0x0005) parmBaseOffset=I16(0x0028), parmSlots=U16(0x0001), localBaseOffset=I16(0xFFF8)
<68f900>        Described JIT args starting at 0x0000000000775880 for U16(0x0001) slots
<68f900>                I-Slot: : a0[0x0000000000775880] = 0x00000000E047D270
<68f900>        Described JIT temps starting at 0x0000000000775850 for IDATA(0x0000000000000001) slots
<68f900>                I-Slot: : t0[0x0000000000775850] = 0x00000000E0191C08
<68f900>        JIT-RegisterMap = UDATA(0x0000000010000002)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B788] = UDATA(0x00000000E0477448) (jit_r31)
<68f900>                JIT-RegisterMap-O-Slot[0x000078988609B780] = 0x00000000006C9200 (jit_r30)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B778] = UDATA(0x0000000000000008) (jit_r29)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B770] = UDATA(0x0000000000000001) (jit_r28)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B768] = UDATA(0x00000000E0477448) (jit_r27)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B760] = UDATA(0x00000000E007C3A8) (jit_r26)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B758] = UDATA(0x00000000E007C3A8) (jit_r25)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B750] = UDATA(0x00000000E03F7578) (jit_r24)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B748] = UDATA(0x00000000E03F7868) (jit_r23)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B740] = UDATA(0x00000000E03F5648) (jit_r22)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B738] = UDATA(0x0000000000000000) (jit_r21)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B730] = UDATA(0x0000000000000000) (jit_r20)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B728] = UDATA(0x0000000000000000) (jit_r19)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B720] = UDATA(0x0000000000000000) (jit_r18)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B718] = UDATA(0xFFFFFFFFFFFFFFFF) (jit_r17)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B710] = UDATA(0x00007898848B0038) (jit_r16)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B708] = UDATA(0x000000000068F900) (jit_r15)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B700] = UDATA(0x0000000000000000) (jit_r14)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6F8] = UDATA(0x0000000000000000) (jit_r13)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6F0] = UDATA(0x0000000000000078) (jit_r12)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6E8] = UDATA(0x00007898A2F9D908) (jit_r11)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6E0] = UDATA(0x0000000000775788) (jit_r10)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6D8] = UDATA(0x0000000000000020) (jit_r9)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6D0] = UDATA(0x00000000FF276100) (jit_r8)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6C8] = UDATA(0x0000000000600000) (jit_r7)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6C0] = UDATA(0x00000000E0000000) (jit_r6)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6B8] = UDATA(0x00000000E03F7868) (jit_r5)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6B0] = UDATA(0x00000000E0477448) (jit_r4)
<68f900>                JIT-RegisterMap-O-Slot[0x000078988609B6A8] = 0x00000000F8004A20 (jit_r3)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B6A0] = UDATA(0x000000000077B800) (jit_r2)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B698] = UDATA(0x0000000000000000) (jit_r1)
<68f900>                JIT-RegisterMap-I-Slot[0x000078988609B690] = UDATA(0x00007898877CADF0) (jit_r0)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B710] = UDATA(0x00007898848B0038) (jit_r16)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B718] = UDATA(0xFFFFFFFFFFFFFFFF) (jit_r17)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B720] = UDATA(0x0000000000000000) (jit_r18)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B728] = UDATA(0x0000000000000000) (jit_r19)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B730] = UDATA(0x0000000000000000) (jit_r20)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B738] = UDATA(0x0000000000000000) (jit_r21)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B740] = UDATA(0x00000000E03F5648) (jit_r22)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B748] = UDATA(0x00000000E03F7868) (jit_r23)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B750] = UDATA(0x00000000E03F7578) (jit_r24)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B758] = UDATA(0x00000000E007C3A8) (jit_r25)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B760] = UDATA(0x00000000E007C3A8) (jit_r26)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B768] = UDATA(0x00000000E0477448) (jit_r27)
<68f900>        JIT-Frame-RegisterMap[0x000078988609B770] = UDATA(0x0000000000000001) (jit_r28)
<68f900>        JIT-Frame-RegisterMap[0x0000000000775830] = UDATA(0x0000000000000008) (jit_r29)
<68f900>        JIT-Frame-RegisterMap[0x0000000000775838] = UDATA(0x0000000000000008) (jit_r30)
<68f900>        JIT-Frame-RegisterMap[0x0000000000775840] = UDATA(0x00000000E047EA38) (jit_r31)

I see two lines for jit_r30.

The first one is the bad O slot:

<68f900>                JIT-RegisterMap-O-Slot[0x000078988609B780] = 0x00000000006C9200 (jit_r30)

That's a J9 class not an object.

The second is the constant 8 which is the actual value in r30:

<68f900>        JIT-Frame-RegisterMap[0x0000000000775838] = UDATA(0x0000000000000008) (jit_r30)

What is the first one supposed to represent and when might it be written to?

I do see an area in the code where r30 is given the value of 0x6c9210:

0x7898877cadac {TestBufferStackStress.lambda$stress$1} +204                      90014ee8 ld        r2, 0x190(r14) <<< ^+426 ^+619 ^+694 // Try{
0x7898877cadb0 {TestBufferStackStress.lambda$stress$1} +205         -1:50        00008208 tdeqi     r2, 0 // invokeinterface 37 {java/.../Arena.allocate(Ljava/.../MemoryLayout;)Ljava/.../MemorySegment;}
0x7898877cadb4 {TestBufferStackStress.lambda$stress$1} +206                      00004360 ori       r3, r2, 0
0x7898877cadb8 {TestBufferStackStress.lambda$stress$1} +207                      0000803c lis       r4, 0 CONST* 0x627c30 = 0xe0477448 Ptr
0x7898877cadbc {TestBufferStackStress.lambda$stress$1} +208                      00008460 ori       r4, r4, 0
0x7898877cadc0 {TestBufferStackStress.lambda$stress$1} +209                      c6078478 sldi      r4, r4, 0x20
0x7898877cadc4 {TestBufferStackStress.lambda$stress$1} +210                      62008464 oris      r4, r4, 0x62
0x7898877cadc8 {TestBufferStackStress.lambda$stress$1} +211                      307c8460 ori       r4, r4, 0x7c30
0x7898877cadcc {TestBufferStackStress.lambda$stress$1} +212                      0000e4eb ld        r31, 0(r4)
0x7898877cadd0 {TestBufferStackStress.lambda$stress$1} +213                      0000e463 ori       r4, r31, 0
0x7898877cadd4 {TestBufferStackStress.lambda$stress$1} +214                      0000c283 lwz       r30, 0(r2)
0x7898877cadd8 {TestBufferStackStress.lambda$stress$1} +215                      2e00de57 rlwinm    r30, r30, 0, 0, 0x17
0x7898877caddc {TestBufferStackStress.lambda$stress$1} +216                      6c00603d lis       r11, 0x6c CONST 0x006c9200 J9Class - {jdk/internal/foreign/BufferStack$PerThread$Frame}
0x7898877cade0 {TestBufferStackStress.lambda$stress$1} +217                      00926b61 ori       r11, r11, 0x9200
0x7898877cade4 {TestBufferStackStress.lambda$stress$1} +218                      40583e7c cmpld     r30, r11
0x7898877cade8 {TestBufferStackStress.lambda$stress$1} +219                      98088240 bne       0x7898877cb680 C>> +769
0x7898877cadec {TestBufferStackStress.lambda$stress$1} +220         -1:50        95660048 bl        0x7898877d1480 {java/.../SegmentAllocator.allocate} +2    // invokeinterface 37 {java/.../Arena.allocate(Ljava/.../MemoryLayout;)Ljava/.../MemorySegment;}

At +204, r2 gets the value:

(kca) (0x775860+0x190)/llx
%3 = 0x00000000007759f0: 00000000f8004a68

At +214, r30 gets:

(kca) (0xf8004a68)/X
%4 = 0x00000000f8004a68: 006c9210

And the lower 8 bits are cleared at +215 to give r30 the value 0x6c9200.

IBMJimmyk avatar Aug 26 '25 20:08 IBMJimmyk

I think: the first section named as JIT-RegisterMap is for the current method in that particular frame, as described by the current GCStackMap's registerMap at that GCPoint. each register's value, if you went back to the GCPoint, is shown (whether it is an O or I slot is also flagged).

the second section named as JIT-Frame-RegisterMap is for the caller, if you walked back up the call chain. It only contains the preserved registers, and you can see this current method only touched r29 - r31 (and saved them in the current frame). So that, you definitely know the bad r30 value (in the first section) was established by this current method, instead of coming from the caller(s) ... upper chain.

zl-wang avatar Aug 26 '25 22:08 zl-wang

so, somehow J9Class pointer established into r30 in this method was misunderstood as a java reference (collectable). that is the bug here.

zl-wang avatar Aug 26 '25 22:08 zl-wang

now, this might be relatively straight-forward to pin down, hopefully traceFull of this method didn't change much from run to run. why and how is a J9Class considered collectable. I suspected it might intend to be the corresponding in-heap java/lang/Class object.

zl-wang avatar Aug 27 '25 10:08 zl-wang

as you walk back up the stack, the second section contents become the caller's first section, and a further second section is prepared for caller's caller. current I-slot could be an O-slot in the caller (differently described by caller's registerMap).

zl-wang avatar Aug 27 '25 11:08 zl-wang

Getting a tracefull log sounds like a good next step. I will try that and see what it says.

IBMJimmyk avatar Aug 27 '25 12:08 IBMJimmyk

And the lower 8 bits are cleared at +215 to give r30 the value 0x6c9200.

that is a frequent pattern of retrieving VFT symbol from an object. i don't expect it can be marked as collectable (otherwise, we probably crashed all over the places). you need to check back to the GCpoint (in this case, 0x00007898877C93C0) where r30 came from at that point.

zl-wang avatar Aug 27 '25 13:08 zl-wang

0x00007898877C93C0 is where the segmentation fault occurred. r30 is changed a little bit before that.

0x7898877c939c {jdk/.../SlicingAllocator.canAllocate} +10                       2800cefb std       r30, 0x28(r14)  //<<--r30 was 0x8 here
0x7898877c93a0 {jdk/.../SlicingAllocator.canAllocate} +11                       3000eefb std       r31, 0x30(r14)
0x7898877c93a4 {jdk/.../SlicingAllocator.canAllocate} +12                       00007e60 ori       r30, r3, 0   //<<--r30 set to 0xe047ea60 here
0x7898877c93a8 {jdk/.../SlicingAllocator.canAllocate} +13                       60008ef8 std       r4, 0x60(r14)
0x7898877c93ac {jdk/.../SlicingAllocator.canAllocate} +14                       0000bd60 ori       r29, r5, 0
0x7898877c93b0 {jdk/.../SlicingAllocator.canAllocate} +15                       10007e80 lwz       r3, 0x10(r30)
0x7898877c93b4 {jdk/.../SlicingAllocator.canAllocate} +16          -1:1         00008308 tdeqi     r3, 0 // getfield 6 jdk/.../SlicingAllocator.segment (Ljava/.../MemorySegment;)
0x7898877c93b8 {jdk/.../SlicingAllocator.canAllocate} +17                       00004380 lwz       r2, 0(r3)
0x7898877c93bc {jdk/.../SlicingAllocator.canAllocate} +18                       2e004254 rlwinm    r2, r2, 0, 0, 0x17
0x7898877c93c0 {jdk/.../SlicingAllocator.canAllocate} +19                       780102e8 ld        r0, 0x178(r2)  //<<--segfault here

IBMJimmyk avatar Aug 27 '25 19:08 IBMJimmyk

a little bit confusing: the r30 value at crash doesn't match with the stackWalking output. could you show the r3 MemorySegment object in full? also, what is r2 at crash (it is supposed to be MemorySegment's J9Class)?

I am wondering if all these are related to the fact that, in java FFM (Foreign Function and Memory) implementation, MemorySegment often times points to native memory (instead of java heap), such that they cannot be compressed for example. somewhere, things are screwed up. similar to DirectByteBuffer ... @TobiAjila

zl-wang avatar Aug 27 '25 20:08 zl-wang

r3 holds a bad value. It holds the value of 0xfa2f9360 and that address is in the middle of a NativeMemorySegmentImpl object.

> whatis 0xfa2f9360
        heap #1 - name: Generational@7898a4136cf0
                0xfa2f9360 is within heap segment: f8000000 -- ff330000
                0xfa2f9360 is within an object on the heap:
                        offset 24 within jdk/internal/foreign/NativeMemorySegmentImpl instance @ 0xfa2f9348

For r2, the javacore and corefile disagree on the value but both are bad. The javacore file says r2 is 0x0. But based on the value of r3, it shouldn't be 0.

0x7898877c93b8 {jdk/.../SlicingAllocator.canAllocate} +17                       00004380 lwz       r2, 0(r3)           //<<---- r2 = 0x9c00a1e0
0x7898877c93bc {jdk/.../SlicingAllocator.canAllocate} +18                       2e004254 rlwinm    r2, r2, 0, 0, 0x17  //<<---- r2 = 0x9c00a100

> whatis 0x9c00a100
        heap #1 - name: Generational@7898a4136cf0
                0x9c00a100 is not within this heap.

This failure is with compressedrefs but the shift value is 0.

Also, I have been trying to get a tracefile but am running into unexpected issues. The tracefile I get is empty and the verbose log says no method from the class TestBufferStackStress are compiled even under count=0,disableasynccompilation. This occurs both when running via grinder and when I run the test off a local machine. I am currently investigating why the option doesn't seem to work.

IBMJimmyk avatar Aug 27 '25 20:08 IBMJimmyk

hopefully, everywhere knows to load NativeMemory pointer with ld, instead of lwz and decompress your showing indicated r30 (i.e. incoming argument r3) contains the bad MemorySegment reference already. the problem was not in this canAllocate method itself. need to look at the caller at least.

zl-wang avatar Aug 27 '25 21:08 zl-wang