openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

openj9 java11 AsyncGetCallTrace segfault, OOM, assert

Open lijingmu opened this issue 2 years ago • 7 comments

Java -version output

openjdk version "11.0.12" 2021-07-20 IBM Semeru Runtime Open Edition 11.0.12.0 (build 11.0.12+7) Eclipse OpenJ9 VM 11.0.12.0 (build openj9-0.27.0, JRE 11 Linux amd64-64-Bit Compressed References 20210729_228 (JIT enabled, AOT enabled) OpenJ9 - 1851b0074 OMR - 9db1c870d JCL - 21849e2ca0 based on jdk-11.0.12+7)

Summary of problem

In our customer env, they encountered "java/lang/OutOfMemoryError" "native memory exhausted". It happened 13 times in 8 hours. Java process is running in container in AWS EKS cluster.

Diagnostic files

There are jitdump, javacore, heapdum in IBM Box https://ibm.box.com/s/3ej7gzrpevtnyrs5hyj7bxtk5w40irfp NATIVEMEMINFO, JIT used a lot of memory.

2MEMUSER       +--JIT: 278,216,584 bytes / 352 allocations
2MEMUSER       |  |
3MEMUSER       |  +--JIT Code Cache: 268,435,456 bytes / 1 allocation
2MEMUSER       |  |
3MEMUSER       |  +--JIT Data Cache: 2,097,216 bytes / 1 allocation
2MEMUSER       |  |
3MEMUSER       |  +--Other: 7,683,912 bytes / 350 allocations

MEMINFO:

1STSEGTYPE     JIT Code Cache
NULL           segment            start              alloc              end                type       size
1STSEGMENT     0x00007F9CC415CCD8 0x00007F9CA8F9E000 0x00007F9CA90835B0 0x00007F9CB8F9E000 0x00000068 0x0000000010000000
NULL
1STSEGTOTAL    Total memory:                   268435456 (0x0000000010000000)
1STSEGINUSE    Total memory in use:               939440 (0x00000000000E55B0)
1STSEGFREE     Total memory free:              267496016 (0x000000000FF1AA50)
1STSEGLIMIT    Allocation limit:               268435456 (0x0000000010000000)
NULL
1STSEGTYPE     JIT Data Cache
NULL           segment            start              alloc              end                type       size
1STSEGMENT     0x00007F9CC415D168 0x00007F9CA2DFF030 0x00007F9CA2FFF030 0x00007F9CA2FFF030 0x00000048 0x0000000000200000
NULL
1STSEGTOTAL    Total memory:                     2097152 (0x0000000000200000)
1STSEGINUSE    Total memory in use:              2097152 (0x0000000000200000)
1STSEGFREE     Total memory free:                      0 (0x0000000000000000)
1STSEGLIMIT    Allocation limit:               402653184 (0x0000000018000000)

Error message

We also observed some message in console when java crashed

2022-07-25T14:09:24Z [queryScriptsThread-1] INFO  TestListLoader:queryTests - X-TransactionID: 09a08c25-5ffa-4c6b-8ade-239f9931e537, QueryTime From: 1658757984432, UTC Time: 2022-07-25T14:06:24.432
Unhandled exception
Type=Segmentation error vmState=0x0002000f
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FDDBCFA3DD0 Handler2=00007FDDBCD03AE0 InaccessibleAddress=0000003000000028
RDI=00007FDD9CF0D9E0 RSI=0000000024600002 RAX=0000003000000020 RBX=00007FDD85698910
RCX=00007FDD9CF0DA00 RDX=0000000000000000 R8=0000000000000000 R9=0000000000000000
R10=00007FDDB6D5FE00 R11=00007FDD103B7899 R12=0000000000000000 R13=000000000086FD00
R14=0000000004600002 R15=00007FDDB90E7018
RIP=00007FDDBCFE2C3A GS=0000 FS=0000 RSP=00007FDD85698890
EFlags=0000000000010202 CS=0033 RBP=0000000000000000 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000003000000028
xmm0 0000000300000003 (f: 3.000000, d: 6.365987e-314)
xmm1 0000000000000003 (f: 3.000000, d: 1.482197e-323)
xmm2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm7 0000000000000094 (f: 148.000000, d: 7.312172e-322)
xmm8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 ca62c1d6ca62c1d6 (f: 3395469824.000000, d: -2.193092e+50)
xmm12 0000000000000001 (f: 1.000000, d: 4.940656e-324)
xmm13 08090a0b0c0d0e0f (f: 202182160.000000, d: 5.924543e-270)
xmm14 597a2efb7188e8a2 (f: 1904797824.000000, d: 1.081798e+123)
xmm15 35a343df8901ff20 (f: 2298609408.000000, d: 2.574558e-50)
Module=/opt/ibm/openjdk/lib/default/libj9vm29.so
Module_base_address=00007FDDBCF65000
Target=2_90_20210729_228 (Linux 5.15.32-flatcar)
CPU=amd64 (32 logical CPUs) (0x1ec389f000 RAM)
----------- Stack Backtrace -----------
(0x00007FDDBCFE2C3A [libj9vm29.so+0x7dc3a])
(0x00007FDDB6D5FEC6 [libj9gc29.so+0x41ec6])
(0x00007FDDB6D57D9D [libj9gc29.so+0x39d9d])
(0x00007FDDB6D56AAF [libj9gc29.so+0x38aaf])
(0x00007FDDB6D59622 [libj9gc29.so+0x3b622])
(0x00007FDDB6EB248B [libj9gc29.so+0x19448b])
(0x00007FDDB6E4FA1E [libj9gc29.so+0x131a1e])
(0x00007FDDB6E8393D [libj9gc29.so+0x16593d])
(0x00007FDDB6EB6D35 [libj9gc29.so+0x198d35])
(0x00007FDDB6EB8A70 [libj9gc29.so+0x19aa70])
(0x00007FDDB6E30566 [libj9gc29.so+0x112566])
(0x00007FDDB6F163FD [libj9gc29.so+0x1f83fd])
(0x00007FDDB6EF44D8 [libj9gc29.so+0x1d64d8])
(0x00007FDDB6E54795 [libj9gc29.so+0x136795])
(0x00007FDDB6E54966 [libj9gc29.so+0x136966])
(0x00007FDDB6E53AB5 [libj9gc29.so+0x135ab5])
(0x00007FDDB6E5956B [libj9gc29.so+0x13b56b])
(0x00007FDDB6D65789 [libj9gc29.so+0x47789])
(0x00007FDDBD00883B [libj9vm29.so+0xa383b])
(0x00007FDDBCFF4EF5 [libj9vm29.so+0x8fef5])
(0x00007FDDBD0A0192 [libj9vm29.so+0x13b192])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2022/07/25 14:09:24 - please wait.

lijingmu avatar Aug 02 '22 05:08 lijingmu

Do you have any system core files? I don't think we can figure anything out from the diagnostic files already provided. If you have a system core, it needs to be processed with jextract or jpackcore (https://www.eclipse.org/openj9/docs/tool_jextract/)

Is it possible to use the latest JVM? Which may provide more information about the native stack, assuming the problem still occurs when using the latest. https://developer.ibm.com/languages/java/semeru-runtimes/downloads/?version=11

pshipton avatar Aug 03 '22 04:08 pshipton

@pshipton I uploaded a zip file including all of the jitdump, headump and javacore https://ibm.box.com/s/arhondqtf9iodxmvlyfah8ewbvrrep1u. We did not see core dumps. We are using -XX:+HeapDumpOnOutOfMemory -XX:HeapDumpPath=xxx.

Another thing strange is that in Current thread, I see Unable to walk in-flight data on call stack Unable to obtain lock context information as below.

1XMCURTHDINFO  Current thread
3XMTHREADINFO      "testScheduleThread-1" J9VMThread:0x0000000000109400, omrthread_t:0x00007F69750F8D78, java/lang/Thread:0x00000000F85A1710, state:R, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x20, isDaemon:false)
3XMTHREADINFO1            (native thread ID:0x38, native priority:0x5, native policy:UNKNOWN, vmstate:R, vm thread flags:0x00001020)
3XMTHREADINFO2            (native stack address range from:0x00007F691ABE0000, to:0x00007F691AC20000, size:0x40000)
3XMCPUTIME               CPU usage total: 12.526471354 secs, current category="Application"
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
1INTERNAL                    Unable to obtain lock context information
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Throwable.fillInStackTrace(Native Method)
4XESTACKTRACE                at java/lang/Throwable.<init>(Throwable.java:90(Compiled Code))
4XESTACKTRACE                at java/lang/Throwable.<init>(Throwable.java:101(Compiled Code))
4XESTACKTRACE                at java/lang/Exception.<init>(Exception.java:66(Compiled Code))
4XESTACKTRACE                at java/lang/RuntimeException.<init>(RuntimeException.java:62(Compiled Code))
1INTERNAL                    Unable to walk in-flight data on call stack

lijingmu avatar Aug 03 '22 04:08 lijingmu

A heapdump isn't useful for "native memory exhausted". The heapdump contains the object heap, but the OOM is for native memory, which is outside of the object heap.

Besides the "java/lang/OutOfMemoryError" "native memory exhausted" and gpf, I also see an assert openj9/runtime/util/mthutil.c:600: ((((uintptr_t)-1) != methodIndex)).

Plus I notice this has something to do with AsyncGetCallTrace. The 0.27 release is too old, there are fixes related to AsyncGetCallTrace in newer versions (0.31 and later), such as https://github.com/eclipse-openj9/openj9/pull/14308 which fixes https://github.com/eclipse-openj9/openj9/issues/13838. Pls update to the latest version. If there are still problems with the newer versions, we'll need to figure out how to get a system core file.

pshipton avatar Aug 03 '22 11:08 pshipton

This is the decoded native stack trace for the GPF:

` CPU=amd64 (32 logical CPUs) (0x1ec389f000 RAM) ----------- Stack Backtrace ----------- (0x00007FDDBCFE2C3A [libj9vm29.so+0x7dc3a]) : 0x7dc3a <walkStackFrames+1386> [../../../../../openj9/runtime/vm/swalk.c:1134]

(0x00007FDDB6D5FEC6 [libj9gc29.so+0x41ec6]) : 0x41ec6 <GC_VMThreadStackSlotIterator [../../../../../openj9/runtime/gc_structs/VMThreadStackSlotIterator.cpp:115]

(0x00007FDDB6D57D9D [libj9gc29.so+0x39d9d]) : 0x39d9d <MM_RootScanner [../../../../../openj9/runtime/gc_base/RootScanner.cpp:521]

(0x00007FDDB6D56AAF [libj9gc29.so+0x38aaf]) : 0x38aaf <MM_RootScanner [../../../../../openj9/runtime/gc_base/RootScanner.cpp:488]

(0x00007FDDB6D59622 [libj9gc29.so+0x3b622]) : 0x3b622 <MM_RootScanner [../../../../../openj9/runtime/gc_base/RootScanner.cpp:921]

(0x00007FDDB6EB248B [libj9gc29.so+0x19448b]) : 0x19448b <MM_Scavenger [../../../../../../openj9/runtime/gc_glue_java/ScavengerRootScanner.hpp:209]

(0x00007FDDB6E4FA1E [libj9gc29.so+0x131a1e]) : 0x131a1e <MM_ParallelDispatcher [../../../../../../omr/gc/base/ParallelDispatcher.cpp:589]

(0x00007FDDB6E8393D [libj9gc29.so+0x16593d]) : 0x16593d <MM_Scavenger [../../../../../../omr/gc/base/standard/Scavenger.cpp:581]

(0x00007FDDB6EB6D35 [libj9gc29.so+0x198d35]) : 0x198d35 <MM_Scavenger [../../../../../../omr/gc/base/standard/Scavenger.cpp:4199]

(0x00007FDDB6EB8A70 [libj9gc29.so+0x19aa70]) : 0x19aa70 <MM_Scavenger [../../../../../../omr/gc/base/standard/Scavenger.cpp:4729]

(0x00007FDDB6E30566 [libj9gc29.so+0x112566]) : 0x112566 <MM_Collector [../../../../../../omr/gc/base/Collector.cpp:502]

(0x00007FDDB6F163FD [libj9gc29.so+0x1f83fd]) : 0x1f83fd <MM_MemorySubSpaceSemiSpace [../../../../../../omr/gc/base/AllocateDescription.hpp:152]

(0x00007FDDB6EF44D8 [libj9gc29.so+0x1d64d8]) : 0x1d64d8 <MM_MemorySubSpaceGeneric [../../../../../../omr/gc/base/MemorySubSpaceGeneric.cpp:377]

(0x00007FDDB6E54795 [libj9gc29.so+0x136795]) : 0x136795 <MM_TLHAllocationSupport [../../../../../../omr/gc/base/TLHAllocationSupport.cpp:239]

(0x00007FDDB6E54966 [libj9gc29.so+0x136966]) : 0x136966 <MM_TLHAllocationSupport [../../../../../../omr/gc/base/TLHAllocationSupport.hpp:115]

(0x00007FDDB6E53AB5 [libj9gc29.so+0x135ab5]) : 0x135ab5 <MM_TLHAllocationInterface [../../../../../../omr/gc/base/TLHAllocationInterface.cpp:193]

(0x00007FDDB6E5956B [libj9gc29.so+0x13b56b]) : 0x13b56b <OMR_GC_AllocateObject(OMR_VMThread*, MM_AllocateInitialization*)+523> [../../../../../../omr/gc/base/AllocateInitialization.hpp:201]

(0x00007FDDB6D65789 [libj9gc29.so+0x47789]) : 0x47789 <J9AllocateObject(J9VMThread*, J9Class*, uintptr_t)+441> [../../../../../openj9/runtime/gc_modron_startup/mgcalloc.cpp:409]

(0x00007FDDBD00883B [libj9vm29.so+0xa383b]) : 0xa383b <VM_BytecodeInterpreterCompressed [../../../../../openj9/runtime/vm/BytecodeInterpreter.hpp:357]

(0x00007FDDBCFF4EF5 [libj9vm29.so+0x8fef5]) : 0x8fef5 <bytecodeLoopCompressed(J9VMThread*)+149> [../../../../../openj9/runtime/vm/BytecodeInterpreter.inc:113]

(0x00007FDDBD0A0192 [libj9vm29.so+0x13b192]) : 0x13b192 <c_cInterpreter+82> [/home/jenkins/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-x64-openj9/workspace/build/src/build/linux-x86_64-normal-server-release/vm/runtime/vm/xcinterp.s:161] ~ ` It means we found a bogus object pointer in Java stack (and likely need our JIT team to investigate further) and we need the system core file to get the details.

manqingl avatar Aug 03 '22 13:08 manqingl

hello @pshipton and @manqingl, thanks for your analysis. Do you think we need to replace JVM arguments -XX:+HeapDumpOnOutOfMemory -XX:HeapDumpPath=xxx with -Xdump:system+java+heap+jit:file=./logs? Or any other arguments do you suggest use to collect enough info if it occurs again? Since it is a customer environment, I hope that we can provide one build with all the required updates, and also the latest JDK if you think it is required.

lijingmu avatar Aug 03 '22 13:08 lijingmu

This looks like a duplicate of https://github.com/eclipse-openj9/openj9/issues/13838, which is fixed in 0.31 or later. Pls update the JVM. There is no point in trying to collect more diagnostics on a JVM which has this known problem.

pshipton avatar Aug 03 '22 13:08 pshipton

OK, upgrade JDK to the latest.

openjdk version "11.0.15" 2022-04-19
IBM Semeru Runtime Open Edition 11.0.15.0 (build 11.0.15+10)
Eclipse OpenJ9 VM 11.0.15.0 (build openj9-0.32.0, JRE 11 Linux amd64-64-Bit Compressed References 20220422_425 (JIT enabled, AOT enabled)

lijingmu avatar Aug 04 '22 01:08 lijingmu