openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

jtreg java/lang/invoke/VarHandles/VarHandleTestAccess Unexpected exit from test [exit code: 137]

Open pshipton opened this issue 3 years ago • 58 comments

Blocking promotion of jdk17.

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_OpenJDK17/9 jdk_lang_1 JVM_OPTIONS: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:-UseCompressedOops java/lang/invoke/VarHandles/VarHandleTestAccessChar.java.VarHandleTestAccessChar java/lang/invoke/VarHandles/VarHandleTestAccessInt.java.VarHandleTestAccessInt java/lang/invoke/VarHandles/VarHandleTestAccessLong.java.VarHandleTestAccessLong java/lang/invoke/VarHandles/VarHandleTestAccessShort.java.VarHandleTestAccessShort

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_OpenJDK17/9/openjdk_test_output.tar.gz

Failed. Unexpected exit from test [exit code: 137]

pshipton avatar Aug 24 '21 11:08 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/2/ jdk_lang_0 java/lang/invoke/VarHandles/VarHandleTestAccessChar.java.VarHandleTestAccessChar java/lang/invoke/VarHandles/VarHandleTestAccessFloat.java.VarHandleTestAccessFloat jdk_lang_1 java/lang/invoke/VarHandles/VarHandleTestAccessChar.java.VarHandleTestAccessChar java/lang/invoke/VarHandles/VarHandleTestAccessDouble.java.VarHandleTestAccessDouble java/lang/invoke/VarHandles/VarHandleTestAccessInt.java.VarHandleTestAccessInt java/lang/invoke/VarHandles/VarHandleTestAccessShort.java.VarHandleTestAccessShort

pshipton avatar Aug 24 '21 12:08 pshipton

This Unexpected exit from test [exit code: 137] wasn't observed at internal tests such as Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Personal/10/. Could be machine/setup related.

JasonFengJ9 avatar Aug 24 '21 13:08 JasonFengJ9

I believe internal testing is running obsolete jdk17 code without OJDK MH enabled.

pshipton avatar Aug 24 '21 14:08 pshipton

Internal builds are using JCL - 363eb1d6e4 based on jdk-17+18), while the (non-promoted) latest is jdk-17+35 with OJDK MH enabled.

pshipton avatar Aug 24 '21 14:08 pshipton

Are internal builds for jdk17 even running? The last one I see is from July 13.

pshipton avatar Aug 24 '21 14:08 pshipton

Please check Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Personal/10/console which is built/test against JDK17 (openj9-staging). Note: this is one of personal builds which have been used to test JDK16/17 recently due to internal nightly/weekly builds not launched since 0.27 release cycle.

13:27:57  openjdk version "17-internal" 2021-09-14
13:27:57  OpenJDK Runtime Environment (build 17-internal+0-adhoc.jenkins.BuildJDK17ppc64lelinuxPersonal)
13:27:57  Eclipse OpenJ9 VM (build master-5d2495e99a7, JRE 17 Linux ppc64le-64-Bit Compressed References 20210823_12 (JIT enabled, AOT enabled)
13:27:57  OpenJ9   - 5d2495e99a7
13:27:57  OMR      - 6eec759cd2d
13:27:57  JCL      - b40e984c3a1 based on jdk-17+35)

JasonFengJ9 avatar Aug 24 '21 14:08 JasonFengJ9

We've promoted the jdk17 content regardless of the failures, so this isn't blocking anything any more, except for the tests being green.

pshipton avatar Aug 24 '21 16:08 pshipton

and any future jdk17 acceptance build but we may already have the final level considering it's after the OpenJDK Final Release Candidate date.

pshipton avatar Aug 24 '21 16:08 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/3/

pshipton avatar Aug 25 '21 11:08 pshipton

I've tried to reproduce this in the runtimes grinder with jdk_lang_0 (https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/17831/consoleFull) and jdk_lang_1 (https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/17818/console), but I haven't seen this error so far. The test either passes, or I get something similar to what happened here (https://github.com/eclipse-openj9/openj9/issues/13367#issuecomment-905425429):

TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

midronij avatar Sep 01 '21 06:09 midronij

Not sure if this is vm or jit so I've set both.

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/17/

pshipton avatar Sep 08 '21 21:09 pshipton

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/18

pshipton avatar Sep 09 '21 13:09 pshipton

Observed at JDK17 0.28 release zLinux build at job/Test_openjdk17_j9_sanity.openjdk_s390x_linux/49/

openjdk version "17-beta" 2021-09-14
IBM Semeru Runtime Open Edition 17.0.0.0-m1 (build 17-beta+35-202109100251)
Eclipse OpenJ9 VM 17.0.0.0-m1 (build openj9-0.28.0-m1, JRE 17 Linux s390x-64-Bit Compressed References 20210909_25 (JIT enabled, AOT enabled)
OpenJ9   - 5eef84fb1
OMR      - 0bf455f83
JCL      - b6c1a570c77 based on jdk-17+35)

[2021-09-10T04:51:48.484Z] Running test jdk_lang_1 ...
[2021-09-10T04:51:48.484Z] ===============================================
[2021-09-10T04:51:48.484Z] jdk_lang_1 Start Time: Thu Sep  9 21:51:46 2021 Epoch Time (ms): 1631249506072
[2021-09-10T04:51:48.484Z] "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_s390x_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_s390x_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
[2021-09-10T04:51:48.484Z] JVMSHRC005I No shared class caches available
[2021-09-10T04:51:48.484Z] JVMSHRC005I No shared class caches available
[2021-09-10T04:51:48.484Z] cache cleanup done
[2021-09-10T04:51:48.484Z] variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode650
[2021-09-10T04:51:48.484Z] JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:-UseCompressedOops 

[2021-09-10T05:11:28.027Z] --------------------------------------------------
[2021-09-10T05:11:28.027Z] TEST: java/lang/invoke/VarHandles/VarHandleTestByteArrayAsLong.java

[2021-09-10T05:11:28.036Z] ACTION: testng -- Failed. Unexpected exit from test [exit code: 137]
[2021-09-10T05:11:28.036Z] REASON: User specified action: run testng/othervm/timeout=360 -Diters=20000 -XX:-TieredCompilation VarHandleTestByteArrayAsLong 
[2021-09-10T05:11:28.036Z] TIME:   38.59 seconds
[2021-09-10T05:11:28.036Z] messages:
[2021-09-10T05:11:28.037Z] command: testng -Diters=20000 -XX:-TieredCompilation VarHandleTestByteArrayAsLong
[2021-09-10T05:11:28.037Z] reason: User specified action: run testng/othervm/timeout=360 -Diters=20000 -XX:-TieredCompilation VarHandleTestByteArrayAsLong 
[2021-09-10T05:11:28.037Z] Mode: othervm [/othervm specified]
[2021-09-10T05:11:28.037Z] elapsed time (seconds): 38.59
[2021-09-10T05:11:28.037Z] configuration:

[2021-09-10T05:11:28.039Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2021-09-10T05:11:28.039Z] --------------------------------------------------
[2021-09-10T05:47:39.245Z] Test results: passed: 801; failed: 1
[2021-09-10T05:47:46.291Z] Report written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_s390x_linux/jvmtest/openjdk/report/html/report.html
[2021-09-10T05:47:46.291Z] Results written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_s390x_linux/aqa-tests/TKG/output_16312461891078/jdk_lang_1/work
[2021-09-10T05:47:46.292Z] Error: Some tests failed or other problems occurred.
[2021-09-10T05:47:46.292Z] 
[2021-09-10T05:47:46.292Z] jdk_lang_1_FAILED

Rerun in Grinder

Also at JDK17 0.28 release aarch64 build job/Test_openjdk17_j9_sanity.openjdk_aarch64_linux/43/

openjdk version "17-beta" 2021-09-14
IBM Semeru Runtime Open Edition 17.0.0.0-m1 (build 17-beta+35-202109100229)
Eclipse OpenJ9 VM 17.0.0.0-m1 (build openj9-0.28.0-m1, JRE 17 Linux aarch64-64-Bit Compressed References 20210910_26 (JIT enabled, AOT enabled)
OpenJ9   - 5eef84fb1
OMR      - 0bf455f83
JCL      - b6c1a570c77 based on jdk-17+35)

[2021-09-10T04:22:34.384Z] Running test jdk_lang_0 ...
[2021-09-10T04:22:34.384Z] ===============================================
[2021-09-10T04:22:34.384Z] jdk_lang_0 Start Time: Fri Sep 10 00:22:30 2021 Epoch Time (ms): 1631247750870
[2021-09-10T04:22:34.384Z] "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_aarch64_linux/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
[2021-09-10T04:22:34.384Z] 
[2021-09-10T04:22:34.384Z] Attempting to destroy all caches in cacheDir /home/jenkins/javasharedresources/
[2021-09-10T04:22:34.384Z] 
[2021-09-10T04:22:34.384Z] JVMSHRC806I Compressed references persistent shared cache "sharedcc_jenkins" has been destroyed. Use option -Xnocompressedrefs if you want to destroy a non-compressed references cache.
[2021-09-10T04:22:34.384Z] JVMSHRC005I No shared class caches available
[2021-09-10T04:22:34.384Z] cache cleanup done
[2021-09-10T04:22:34.384Z] variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode150
[2021-09-10T04:22:34.384Z] JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -XX:+UseCompressedOops 

[2021-09-10T04:30:14.480Z] --------------------------------------------------
[2021-09-10T04:30:14.480Z] TEST: java/lang/invoke/VarHandles/VarHandleTestByteArrayAsFloat.java

[2021-09-10T04:30:14.484Z] REASON: User specified action: run testng/othervm/timeout=360 -Diters=20000 VarHandleTestByteArrayAsFloat 
[2021-09-10T04:30:14.484Z] TIME:   23.593 seconds
[2021-09-10T04:30:14.484Z] messages:
[2021-09-10T04:30:14.484Z] command: testng -Diters=20000 VarHandleTestByteArrayAsFloat
[2021-09-10T04:30:14.484Z] reason: User specified action: run testng/othervm/timeout=360 -Diters=20000 VarHandleTestByteArrayAsFloat 
[2021-09-10T04:30:14.485Z] Mode: othervm [/othervm specified]
[2021-09-10T04:30:14.485Z] elapsed time (seconds): 23.593
[2021-09-10T04:30:14.485Z] configuration:

[2021-09-10T04:30:14.487Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2021-09-10T04:30:14.487Z] --------------------------------------------------
[2021-09-10T04:40:12.867Z] Test results: passed: 803; failed: 1
[2021-09-10T04:40:21.330Z] Report written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_aarch64_linux/jvmtest/openjdk/report/html/report.html
[2021-09-10T04:40:21.330Z] Results written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_aarch64_linux/aqa-tests/TKG/output_163124775083/jdk_lang_0/work
[2021-09-10T04:40:21.330Z] Error: Some tests failed or other problems occurred.
[2021-09-10T04:40:21.330Z] 
[2021-09-10T04:40:21.330Z] jdk_lang_0_FAILED

Rerun in Grinder

JasonFengJ9 avatar Sep 10 '21 11:09 JasonFengJ9

@midronij : Although there seem to be failures occurring on other platforms, I'd appreciate it if you could continue to triage this problem to narrow down the issue please.

0xdaryl avatar Sep 13 '21 14:09 0xdaryl

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/21 https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/22 https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64le_linux_Nightly/23

pshipton avatar Sep 13 '21 15:09 pshipton

@babsingh Please take a look at this one as well

tajila avatar Sep 14 '21 13:09 tajila

Google search for exit code: 137 in a Java app

The search indicates that

[Ref 1] the application container memory has been exceeded (NOT the JVM hitting OOM).

[Ref 2] The exit code 137 then resolves to 128 + 9, whereas Signal 9 is SIGKILL, i.e. the process was forcefully killed. This can among others be a "kill -9 " command. However in your case this could be an out of memory condition on the operating system, which causes a functionality called "OOM Killer" to stop the process which is using up most of the memory in order to keep the OS itself stable even in such a condition.

Ref 1: Java Application gets Out of Memory exit code 137 Ref 2: resolving-java-result-137

The above comments suggests that these failures are due to a machine issue.

Observations

  • The failure only occurs on ub16-ppcle-1. The 9 Sept 21 build #19 passed, and it was run on cent7-ppcle-3.
  • The failure is only seen on ppc64le.
  • Different test targets failed in different builds:
    • #21: VarHandleTestAccessLong, VarHandleTestAccessInt, VarHandleTestAccessShort.
    • #22: VarHandleTestAccessInt, VarHandleTestAccessLong.
    • #23: VarHandleTestAccessByte, VarHandleTestAccessShort.
  • No Java error is seen in the test output. All tests successfully pass as per the test log.
  • The tests involve many iterations: -Diters=20000, which suggests a lot of memory is used.

Fix

To test the above hypothesis, we should disable ub16-ppcle-1 and let the jdk17_j9_sanity.openjdk builds run on a different pLinux machine. @AdamBrousseau Can you please help with disabling ub16-ppcle-1?

babsingh avatar Sep 14 '21 17:09 babsingh

I think ub16-ppcle-1 is the only external more modern / faster machine we have (POWER8 (architected) - 8359424 kB), the only machine from osu in the builds. The others are all older hardware? at unb. I can disable it temporarily but that is not a fix. Plus it also occurs internally on other machines, other platforms, see https://github.com/eclipse-openj9/openj9/issues/13368#issuecomment-916834915. It didn't occur before jdk17 with OJDK MH. If there is a JVM problem using too much memory, customers may also hit the problem.

pshipton avatar Sep 14 '21 21:09 pshipton

Machine has been disabled for the builds tonight anyway.

pshipton avatar Sep 14 '21 21:09 pshipton

Note the very first failure is on cent7-ppcle-3 (POWER8E (raw) - 7834176 kB)

pshipton avatar Sep 14 '21 21:09 pshipton

If there is a JVM problem using too much memory, customers may also hit the problem.

In this case, the JVM should throw an OOM. But, no error/exception is thrown. Instead, the OS kills the Java process with a SIGKILL. Even, no diagnostic core files are generated for this failure.

babsingh avatar Sep 16 '21 17:09 babsingh

I can disable it temporarily but that is not a fix.

I ran the test locally and in grinders (on internal pLinux machines) to reproduce the issue. But, I was unable to reproduce the failure.

I do see differences in the failing and passing test output. Based on the difference, the failure occurs while executing the "VarHandle -> Array" testcase in all variants of the failure. Code for the VarHandle -> Array test case. It executes a lot of VarHandle operations on an array. These VH operations translate to calling the Unsafe methods. The OS may be sending a SIGKILL while calling one of these Unsafe methods.

Next steps: Get access to one of the machines where the failure is reproducible. Study the failure locally on the machine with attempts to generate diagnostic data.

Failing test output:

22:45:20  STDOUT:
22:45:20  config VarHandleTestAccessInt.setup(): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance final field", VarHandleBaseTest$VarHandleAccessTestCase@780aa1ad): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance final field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@de36362d): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static final field", VarHandleBaseTest$VarHandleAccessTestCase@1139905c): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static final field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@eabfa1f0): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance field", VarHandle -> Instance field): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@d2079fae): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static field", VarHandle -> Static field): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@ec40226b): success
>>> Java process unexpectedly exits with code: 137 <<<

Passing test output:

22:45:20  config VarHandleTestAccessInt.setup(): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance final field", VarHandleBaseTest$VarHandleAccessTestCase@d651b50a): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance final field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@536692e9): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static final field", VarHandleBaseTest$VarHandleAccessTestCase@63cb9c16): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static final field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@24c2a6c0): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance field", VarHandle -> Instance field): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Instance field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@199c0fda): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static field", VarHandle -> Static field): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Static field unsupported", VarHandleBaseTest$VarHandleAccessTestCase@72269002): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Array", VarHandle -> Array): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Array unsupported", VarHandle -> Array unsupported): success
22:45:20  test VarHandleTestAccessInt.testAccess("VarHandle -> Array index out of bounds", VarHandleBaseTest$VarHandleAccessTestCase@a383b4a0): success
22:45:20  test VarHandleTestAccessInt.testEquals(): success
22:45:20  test VarHandleTestAccessInt.testIsAccessModeSupported(java.lang.invoke.VarHandleInts$FieldInstanceReadWrite@f53bc7f1): success
22:45:20  test VarHandleTestAccessInt.testIsAccessModeSupported(java.lang.invoke.VarHandleInts$FieldStaticReadWrite@59738b60): success
22:45:20  test VarHandleTestAccessInt.testIsAccessModeSupported(java.lang.invoke.VarHandleInts$Array@264eda3c): success
22:45:20  test VarHandleTestAccessInt.testLookupInstanceToStatic(): success
22:45:20  test VarHandleTestAccessInt.testLookupStaticToInstance(): success
22:45:20  test VarHandleTestAccessInt.testTypes(java.lang.invoke.VarHandleInts$FieldInstanceReadWrite@f53bc7f1, [class VarHandleTestAccessInt]): success
22:45:20  test VarHandleTestAccessInt.testTypes(java.lang.invoke.VarHandleInts$FieldStaticReadWrite@59738b60, []): success
22:45:20  test VarHandleTestAccessInt.testTypes(java.lang.invoke.VarHandleInts$Array@264eda3c, [class [I, int]): success

babsingh avatar Sep 16 '21 17:09 babsingh

Instead, the OS kills the Java process with a SIGKILL. Even, no diagnostic core files are generated for this failure.

This could indicate a native memory leak. You could try running the test with memory tracking enabled, or using OS memory tools to look at the memory used by the process while it's still running.

Get access to one of the machines where the failure is reproducible.

Since ub16-ppcle-1 is disabled in jenkins, it's sitting idle.

pshipton avatar Sep 16 '21 20:09 pshipton

Something like -Xcheck:memory:noscan,quick,callsite=1000 will dump a summary of allocated memory every 1000 allocations. This will only find leaks that use "malloc". If it's virtual memory, perhaps as allocated as workspace for the jit, using an OS tool is better.

pshipton avatar Sep 16 '21 20:09 pshipton

Even if not a leak, it could indicate we're using too much native memory and need to be more efficient. It may be worth running the same test with Hotspot on the same machine.

pshipton avatar Sep 16 '21 20:09 pshipton

Since ub16-ppcle-1 is disabled in jenkins, it's sitting idle.

in the process of acquiring access to ssh and run a grinder on the above machine. fyi @AdamBrousseau.

babsingh avatar Sep 16 '21 20:09 babsingh

Added Babneet's key to the machine. Enabled the machine in Jenkins so he can run grinder on it too. Removed the test label so other jobs don't land on it. This will need to be put back when he's done.

AdamBrousseau avatar Sep 17 '21 15:09 AdamBrousseau

Summary

This issue only happens with the JIT enabled in JDK17. The issue is neither seen with -Xint nor JDK11.

I have personally seen high RES memory usage on pLinux and xLinux. As per https://github.com/eclipse-openj9/openj9/issues/13368#issuecomment-916834915, this failure was also seen on zLinux and aarch64. Hence, this behaviour most probably happens across all platforms.

fyi @0xdaryl

More Details

As per the dmesg output, the oom-killer was called due to excessive resident memory usage.

I monitored memory usage via multiple tools: valgrind -tool=massif, top, -Xcheck:memory:noscan,quick,callsite=1000 and javacores.

valgrind -tool=massif and -Xcheck:memory.* didn't track the excessive memory usage.

top showed that excessive and in some cases all of the RESident memory (RAM) was used by the OpenJ9 process. In the latter case, the oom-killer was invoked.

I created javacores at startup and shutdown. The only memory exhausted was the "Internal Memory" region, which is only used by the JIT.

With -Xint, RES memory consistently stayed at

  • ~50MB with JDK17 on pLinux (ppcle64)
  • ~100MB with JDK17 and JDK11 on xLinux (x86)

With the JIT enabled, RES memory varied throughout while reaching high numbers

  • 100MB-4GB with JDK17 on pLinux (ppcle64)
  • 100MB-2GB with JDK17 on xLinux (x86)
  • ~200MB with JDK11 on xLinux (x86)

dmesg Output

[  +0.000001] Out of memory: Kill process 19636 (java) score 469 or sacrifice child
[  +0.000448] Killed process 19636 (java) total-vm:6767872kB, anon-rss:3904960kB, file-rss:0kB
[Sep20 16:44] JIT Compilation invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[  +0.000006] JIT Compilation cpuset=/ mems_allowed=0
[  +0.000059] CPU: 3 PID: 26956 Comm: JIT Compilation Not tainted 4.4.0-210-generic #242-Ubuntu
[  +0.000002] Call Trace:
[  +0.000057] [c0000000fec37650] [c000000000b3a118] dump_stack+0xc4/0x108 (unreliable)
[  +0.000003] [c0000000fec37690] [c000000000b36d80] dump_header+0x84/0x210
[  +0.000028] [c0000000fec37760] [c00000000023c528] oom_kill_process+0x3d8/0x590
[  +0.000003] [c0000000fec37820] [c00000000023cd94] out_of_memory+0x5f4/0x610
[  +0.000002] [c0000000fec378d0] [c000000000245b20] __alloc_pages_nodemask+0xf50/0xf90
[  +0.000011] [c0000000fec37ab0] [c0000000002ae154] alloc_pages_current+0xb4/0x1a0
[  +0.000003] [c0000000fec37b00] [c0000000002377ec] __page_cache_alloc+0x11c/0x160
[  +0.000002] [c0000000fec37b40] [c00000000023a378] filemap_fault+0x1f8/0x500
[  +0.000002] [c0000000fec37bd0] [c0000000003c4540] ext4_filemap_fault+0x50/0x80
[  +0.000004] [c0000000fec37c10] [c00000000027aae8] __do_fault+0xa8/0x1a0
[  +0.000002] [c0000000fec37cb0] [c00000000028081c] handle_mm_fault+0xd9c/0x1940
[  +0.000002] [c0000000fec37d80] [c000000000b30664] do_page_fault+0x364/0x830
[  +0.000003] [c0000000fec37e30] [c0000000000088e4] handle_page_fault+0x10/0x30
[  +0.000008] Mem-Info:
[  +0.000006] active_anon:125883 inactive_anon:1449 isolated_anon:0
               active_file:24 inactive_file:11 isolated_file:0
               unevictable:74 dirty:0 writeback:0 unstable:0
               slab_reclaimable:800 slab_unreclaimable:1490
               mapped:129 shmem:1577 pagetables:112 bounce:0
               free:170 free_pcp:15 free_cma:0

Memory Profiles + Allocation Trees

Javacores: Startup and Shutdown

babsingh avatar Sep 22 '21 13:09 babsingh

An occurrence of exit code (137) observed at job/Test_openjdk17_j9_special.system_s390x_linux_Personal_testList_0/40/tapResults/ (ub18s390xrt-1-2)

===============================================
Running test DaaLoadTest_daa1_special_5m_20 ...
===============================================
DaaLoadTest_daa1_special_5m_20 Start Time: Thu Sep 23 21:25:27 2021 Epoch Time (ms): 1632457527204
variation: Mode614
JVM_OPTIONS:  -Xcompressedrefs -Xgcpolicy:gencon -Xjit:counts=- - - - - - 1 1 1 1000 250 250 - - - 10000 100000 10000,gcOnResolve,rtResolve,sampleInterval=2,scorchingSampleThreshold=10000,quickProfile -Xmn512k -Xcheck:gc:vmthreads:all:quiet 
STF 21:25:27.280 - =========================   S T F   =========================

openjdk version "17-internal" 2021-09-14
OpenJDK Runtime Environment (build 17-internal+0-adhoc.jenkins.BuildJDK17s390xlinuxPersonal)
Eclipse OpenJ9 VM (build v0.28.0-release-35fad07466a, JRE 17 Linux s390x-64-Bit Compressed References 20210923_57 (JIT enabled, AOT enabled)
OpenJ9   - 35fad07466a
OMR      - e5975af44fc
JCL      - 41171df43e2 based on jdk-17+35)

DLT 21:26:31.681 - Completed 20.0%. Number of tests started=131 (+40)
DLT stderr Killed
STF 21:28:53.138 - **FAILED** Process DLT ended with exit code (137) and not the expected exit code/s (0)
STF 21:28:53.141 - Monitoring Report Summary:
STF 21:28:53.147 -   o Process DLT ended with exit code (137) and not the expected exit code/s (0)
STF 21:28:53.248 - Killing processes: DLT
STF 21:28:53.249 -   o Process DLT pid 18298 is not running
**FAILED** at step 1 (Run daa load test). Expected return value=0 Actual=1 at /home/jenkins/workspace/Test_openjdk17_j9_special.system_s390x_linux_Personal_testList_0/aqa-tests/TKG/../TKG/output_16324470864615/DaaLoadTest_daa1_special_5m_20/20210923-212527-DaaLoadTest/execute.pl line 116.
STF 21:28:54.126 - **FAILED** execute script failed. Expected return value=0 Actual=1
STF 21:28:54.134 - 
STF 21:28:54.140 - ====================   T E A R D O W N   ====================
STF 21:28:54.140 - Running teardown: perl /home/jenkins/workspace/Test_openjdk17_j9_special.system_s390x_linux_Personal_testList_0/aqa-tests/TKG/../TKG/output_16324470864615/DaaLoadTest_daa1_special_5m_20/20210923-212527-DaaLoadTest/tearDown.pl
STF 21:28:54.442 - TEARDOWN stage completed
STF 21:28:54.451 - 
STF 21:28:54.451 - =====================   R E S U L T S   =====================
STF 21:28:54.451 - Stage results:
STF 21:28:54.452 -   setUp:     pass
STF 21:28:54.453 -   execute:  *fail*
STF 21:28:54.453 -   teardown:  pass
STF 21:28:54.453 - 
STF 21:28:54.453 - Overall result: **FAILED**

DaaLoadTest_daa1_special_5m_20_FAILED

JasonFengJ9 avatar Sep 24 '21 11:09 JasonFengJ9

I looked at the verbose log from https://github.com/eclipse-openj9/openj9/issues/13368#issuecomment-924930306 and saw 98 JIT scratch segments of 16MB each. There is just one compilation thread active, which most likely is performing a compilation. A compilation thread is not allowed to use more than 256MB of scratch memory, but for JSR292 methods we allow 6 times that value, 1.5GB. In the test I see slightly more than 1.5GB being used, but I would like to check whether limiting that amount reduces the scratch memory consumption. Could we please try some runs with -Xjit:scratchSpaceFactorWhenJSR292Workload=1 The test may already set -Xjit: options, so maybe the option I suggested could be added with export TR_Options="scratchSpaceFactorWhenJSR292Workload=1" If this solves the issue we need to understand which compilation is so memory hungry. For that we may need a JIT verbose log from -Xjit:verbose={compilePerformance},vlog=vlog.txt

mpirvu avatar Sep 24 '21 14:09 mpirvu