openj9 icon indicating copy to clipboard operation
openj9 copied to clipboard

JDK 18/19 jdk_vector_0_FAILED Unexpected exit from test [exit code: 137]

Open JasonFengJ9 opened this issue 2 years ago • 5 comments

Failure link

From an internal build job/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/1/consoleFull(ub18-aarch64-7):

11:36:29  openjdk version "19-internal" 2022-09-20
11:36:29  OpenJDK Runtime Environment (build 19-internal-adhoc.jenkins.BuildJDKnextaarch64linuxPersonal)
11:36:29  Eclipse OpenJ9 VM (build exclude19-52f04efbff5, JRE 19 Linux aarch64-64-Bit Compressed References 20220607_35 (JIT enabled, AOT enabled)
11:36:29  OpenJ9   - 52f04efbff5
11:36:29  OMR      - c60867497c6
11:36:29  JCL      - 5ccf02de16a based on jdk-19+25)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

14:55:40  ===============================================
14:55:40  Running test jdk_vector_0 ...
14:55:40  ===============================================
14:55:40  jdk_vector_0 Start Time: Tue Jun  7 14:55:40 2022 Epoch Time (ms): 1654628140115
14:55:40  "/home/jenkins/workspace/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdknext_j9_extended.openjdk_aarch64_linux_Personal/openjdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
14:55:40  JVMSHRC005I No shared class caches available
14:55:40  JVMSHRC005I No shared class caches available
14:55:40  cache cleanup done
14:55:40  variation: Mode150
14:55:40  JVM_OPTIONS:  -XX:+UseCompressedOops 

15:00:05  TEST: jdk/incubator/vector/Byte128VectorTests.java

15:00:05  TEST: jdk/incubator/vector/Byte256VectorTests.java

15:11:53  TEST: jdk/incubator/vector/Int128VectorTests.java

15:23:13  TEST: jdk/incubator/vector/Short256VectorTests.java

15:23:13  TEST RESULT: Failed. Unexpected exit from test [exit code: 137]

15:23:13  --------------------------------------------------
15:33:12  Test results: passed: 67; failed: 4

15:33:31  jdk_vector_0_FAILED

JasonFengJ9 avatar Jun 08 '22 14:06 JasonFengJ9

@gita-omr

pshipton avatar Jun 08 '22 22:06 pshipton

Looking.

gita-omr avatar Jun 09 '22 18:06 gita-omr

An internal build(cent7-aarch64-6)

openjdk version "18.0.1.1" 2022-04-22
IBM Semeru Runtime Open Edition 18.0.2.0-m2 (build 18.0.1.1+2)
Eclipse OpenJ9 VM 18.0.2.0-m2 (build master-8ead00aa6, JRE 18 Linux aarch64-64-Bit Compressed References 20220714_19 (JIT enabled, AOT enabled)
OpenJ9   - 8ead00aa6
OMR      - 29756eb80
JCL      - c0b6f04919 based on jdk-18.0.1.1+2)

[2022-07-14T23:45:13.370Z] variation: Mode150
[2022-07-14T23:45:13.370Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2022-07-15T00:13:52.837Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-07-15T00:13:52.837Z] --------------------------------------------------
[2022-07-15T00:22:18.492Z] Test results: passed: 71; failed: 1
[2022-07-15T00:22:30.546Z] Report written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/jvmtest/openjdk/report/html/report.html
[2022-07-15T00:22:30.546Z] Results written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/aqa-tests/TKG/output_16578313613568/jdk_vector_0/work
[2022-07-15T00:22:30.546Z] Error: Some tests failed or other problems occurred.
[2022-07-15T00:22:30.546Z] 
[2022-07-15T00:22:30.546Z] jdk_vector_0_FAILED

JasonFengJ9 avatar Jul 18 '22 22:07 JasonFengJ9

FYI @Akira1Saitoh

knn-k avatar Jul 19 '22 04:07 knn-k

An internal build(ub18-aarch64-8)

[2022-08-04T22:30:39.700Z] variation: Mode150
[2022-08-04T22:30:39.700Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2022-08-04T22:50:43.793Z] TEST: jdk/incubator/vector/Long128VectorTests.java
[2022-08-04T22:50:43.806Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:50:43.806Z] --------------------------------------------------
[2022-08-04T22:52:40.943Z] TEST: jdk/incubator/vector/LongMaxVectorTests.java

[2022-08-04T22:52:40.951Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:52:40.951Z] --------------------------------------------------
[2022-08-04T22:58:11.608Z] TEST: jdk/incubator/vector/ShortMaxVectorTests.java
[2022-08-04T22:58:11.614Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 137]
[2022-08-04T22:58:11.614Z] --------------------------------------------------
[2022-08-04T23:06:28.986Z] Test results: passed: 69; failed: 3
[2022-08-04T23:06:34.769Z] Report written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/jvmtest/openjdk/report/html/report.html
[2022-08-04T23:06:34.769Z] Results written to /home/jenkins/workspace/Test_openjdk18_j9_extended.openjdk_aarch64_linux/aqa-tests/TKG/output_16596423323148/jdk_vector_0/work
[2022-08-04T23:06:34.769Z] Error: Some tests failed or other problems occurred.
[2022-08-04T23:06:34.769Z] 
[2022-08-04T23:06:34.769Z] jdk_vector_0_FAILED

JasonFengJ9 avatar Aug 05 '22 18:08 JasonFengJ9

@knn-k any update on this?

tajila avatar Oct 18 '22 13:10 tajila

It seems that Linux kernel OOM killer terminates test java process of jdk_vector_0. A certain test spends more than 4GB of memory and multiple tests are running in parallel. Hence, it uses up the system memory (8GB in this case), resulting in OOM killer. Below is the excerpt from dmesg on an internal machine (ub18-aarch64).

[3340113.392290] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[3340113.392296] [  498]     0   498    68302     2111   540672     1553             0 systemd-journal
[3340113.392298] [  507]     0   507    21532        2    53248       66             0 lvmetad
[3340113.392300] [  517]     0   517     4212      116    57344      205         -1000 systemd-udevd
[3340113.392302] [  667]   100   667     2842       29    61440      127             0 systemd-network
[3340113.392303] [  685]     0   685     1631       22    57344      110             0 rpcbind
[3340113.392305] [  794]     0   794    96581      106   110592      347             0 udisksd
[3340113.392306] [  801]   103   801     6205      160    90112      329          -900 dbus-daemon
[3340113.392308] [  845]     0   845      912        8    49152       45             0 atd
[3340113.392309] [  851]     0   851    20096       37    53248       61             0 irqbalance
[3340113.392311] [  858]     0   858   166867        0   126976      249             0 lxcfs
[3340113.392313] [  864]     0   864    26124        2    98304     2042             0 networkd-dispat
[3340113.392314] [  866]     0   866     2675       27    57344       46             0 cron
[3340113.392316] [  875]     0   875    59565       49    98304      230             0 accounts-daemon
[3340113.392318] [  876]   102   876    59380      133   106496      591             0 rsyslogd
[3340113.392319] [  889]     0   889    57868      100    81920      146             0 polkitd
[3340113.392321] [  899]   101   899     2620       21    61440      161             0 systemd-resolve
[3340113.392323] [ 1121]     0  1121   444036     1599   364544     4125          -999 containerd
[3340113.392324] [ 1138]   111  1138    22000       35    86016      329             0 ntpd
[3340113.392326] [ 1143]     0  1143     2605        2    61440      181         -1000 sshd
[3340113.392328] [ 1310]     0  1310   365408     2703   442368     7434          -500 dockerd
[3340113.392329] [ 1518]     0  1518     2335        0    53248       34             0 agetty
[3340113.392331] [ 1532]     0  1532     2391        0    53248       35             0 agetty
[3340113.392333] [ 1002]     0  1002     3090        0    61440      283             0 systemd
[3340113.392334] [ 1003]     0  1003    43283        1   110592      746             0 (sd-pam)
[3340113.392337] [15684]     0 15684     6974        2    94208      475             0 sshd
[3340113.392339] [15751]  1000 15751     8127       51   106496     1581             0 sshd
[3340113.392340] [15806]  1000 15806  1557255     9367   819200    27520             0 java
[3340113.392342] [21187]  1000 21187   321581        0   368640     2606             0 Xvfb
[3340113.392344] [21230]  1000 21230      480        5    40960       20             0 sh
[3340113.392346] [21231]  1000 21231      480       19    40960       10             0 sh
[3340113.392347] [21232]  1000 21232      480        0    36864       26             0 sh
[3340113.392348] [21238]  1000 21238     2311        2    53248       86             0 make
[3340113.392350] [21284]  1000 21284     2340        2    53248      117             0 make
[3340113.392351] [21388]  1000 21388      480        0    36864       25             0 sh
[3340113.392353] [21389]  1000 21389     2339        1    53248      119             0 make
[3340113.392354] [21391]  1000 21391      480        0    45056       25             0 sh
[3340113.392355] [21392]  1000 21392     2342        2    49152      127             0 make
[3340113.392357] [21451]  1000 21451      480        0    45056       27             0 sh
[3340113.392358] [21452]  1000 21452      480        0    45056       30             0 sh
[3340113.392359] [21453]  1000 21453     2211        0    53248       27             0 tee
[3340113.392361] [21455]  1000 21455  1154331     5839   622592    10454             0 java
[3340113.392362] [21714]  1000 21714   777895        0   696320    30295             0 java
[3340113.392364] [21743]  1000 21743   777508        0   679936    29266             0 java
[3340113.392365] [21773]  1000 21773   777766      983   700416    28785             0 java
[3340113.392367] [23404]  1000 23404  1621004   733817  7254016    65847             0 java
[3340113.392369] [23565]  1000 23565  1891469  1000758  9441280    24641             0 java
[3340113.392370] [23705]  1000 23705   964486   158287  2023424     8570             0 java
[3340113.392372] [23769]  1000 23769     2209       17    53248        0             0 sleep
[3340113.392374] [23777]     0 23777     2532      135    57344        0             0 systemd-logind
[3340113.392375] Out of memory: Kill process 23565 (java) score 448 or sacrifice child
[3340113.438997] Killed process 23565 (java) total-vm:7565876kB, anon-rss:4003052kB, file-rss:0kB, shmem-rss:0kB
[3340113.654549] oom_reaper: reaped process 23565 (java), now anon-rss:4kB, file-rss:0kB, shmem-rss:0kB

Akira1Saitoh avatar Oct 19 '22 13:10 Akira1Saitoh

@gita-omr - Any new news?

vij-singh avatar Oct 25 '22 13:10 vij-singh

@Akira1Saitoh got interesting findings. The tests are actually very small but looks like a lot of them are running in parallel. I am not very familiar with testng but maybe there are settings to run fewer processes/threads? I think this would be the first thing to try. @JasonFengJ9

gita-omr avatar Oct 25 '22 14:10 gita-omr

The tests are actually very small but looks like a lot of them are running in parallel. I am not very familiar with testng but maybe there are settings to run fewer processes/threads?

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

JasonFengJ9 avatar Nov 01 '22 22:11 JasonFengJ9

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

Maybe we can try that?

gita-omr avatar Nov 02 '22 04:11 gita-omr

Is it possible to use jtreg option to set the concurrency to lower value?

Akira1Saitoh avatar Nov 02 '22 04:11 Akira1Saitoh

I am seeing the failure on the internal machine with 8 cpu cores and 8 GB memory. The test does not fail on another machine with 8 cpu cores and 16 GB memory.

Akira1Saitoh avatar Nov 02 '22 04:11 Akira1Saitoh

@gita-omr are you referring to run individual tests like jdk/incubator/vector/Byte128VectorTests.java (there is a grinder option to do so) instead of jdk_vector_0?

Maybe we can try that?

An example internal build from https://github.com/eclipse-openj9/openj9/issues/16244

Is it possible to use jtreg option to set the concurrency to lower value?

@llxia any insights?

JasonFengJ9 avatar Nov 02 '22 11:11 JasonFengJ9

Is it possible to use jtreg option to set the concurrency to lower value?

In OpenJDK tests, we set concurrency based on the # following: min(NPROCS/2, MEM_IN_GB/2). To change concurrency, change L66 in your branch. https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/openjdk.mk#L46-L66

llxia avatar Nov 02 '22 12:11 llxia

Hmm. the concurrency on aarch64 machine (8GB memory) is 3 while it is 1 on x86 machine (ub20x64rt3) with 4GB memory. This might be the reason why the test only fails on aarch64.

on aarch64:

$ cat Makefile 

MEMORY_SIZE:=$(shell KMEMMB=`awk '/^MemTotal:/{print int($$2/1024)}' /proc/meminfo`; if [ -r /sys/fs/cgroup/memory/memory.limit_in_bytes ]; then CGMEM=`cat /sys/fs/cgroup/memory/memory.limit_in_bytes`; else CGMEM=`expr $${KMEMMB} \* 1024`; fi; CGMEMMB=`expr $${CGMEM} / 1048576`; if [ "$${KMEMMB}" -lt "$${CGMEMMB}" ]; then echo "$${KMEMMB}"; else echo "$${CGMEMMB}"; fi)

All:
        @echo ${MEMORY_SIZE}
$ make
7973
$ cat /proc/meminfo 
MemTotal:        8165204 kB

on x86:

$ make
3931

$ cat /proc/meminfo 
MemTotal:        4025600 kB

Akira1Saitoh avatar Nov 04 '22 08:11 Akira1Saitoh

@llxia Thanks for the pointer. Is it possible to change the concurrency only for jdk_vector test?

Akira1Saitoh avatar Nov 04 '22 08:11 Akira1Saitoh

@Akira1Saitoh I think it is possible to do it with new parameters:

In https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/openjdk.mk,

EXTRA_JTREG_OPTIONS += -concurrency:$(JTREG_CONC)
...
# set JTREG_BASIC_OPTIONS value into a new parameter before adding EXTRA_JTREG_OPTIONS
JTREG_BASIC_OPTIONS_WO_EXTRA_OPTS = $(JTREG_BASIC_OPTIONS)
JTREG_BASIC_OPTIONS += $(EXTRA_JTREG_OPTIONS)

# add another new parameter for concurrency
SPECIAL_CONCURRENCY=xxxx

In https://github.com/adoptium/aqa-tests/blob/d5e241df7fe9a7648520efe4ebe1defc5fa5fd7d/openjdk/playlist.xml#L1832, replace $(JTREG_BASIC_OPTIONS) with $(JTREG_BASIC_OPTIONS_WO_EXTRA_OPTS) $(SPECIAL_CONCURRENCY)

llxia avatar Nov 04 '22 16:11 llxia

Thanks @llxia. I opened https://github.com/adoptium/aqa-tests/pull/4119. Launched the internal Grinder job job/Grinder/29324/ with that change.

Akira1Saitoh avatar Nov 07 '22 08:11 Akira1Saitoh

Launched the internal Grinder job job/Grinder/29324/ with that change.

Grinder/29324 passed 8 times. It failed because FlowNode was not found at the end, which is unrelated to the change.

llxia avatar Nov 07 '22 18:11 llxia

Re-launched the job. (job/Grinder/29389/)

Akira1Saitoh avatar Nov 07 '22 23:11 Akira1Saitoh

Grinder/29389 passed 13 times. It failed due to FlowNode was not found error again.

Akira1Saitoh avatar Nov 08 '22 23:11 Akira1Saitoh

Re-launched the Grinder. 20 iterations passed. (/job/Grinder/29428/)

Akira1Saitoh avatar Nov 09 '22 09:11 Akira1Saitoh

@Akira1Saitoh Do you think this will be resolved within 2 weeks?

tajila avatar Nov 10 '22 15:11 tajila

@tajila https://github.com/adoptium/aqa-tests/pull/4119 was already merged. Since then, I have not seen this failure in several Grinder runs. I believe this issue can be closed now. To address the excessive memory consumption when running jdk_vector_0, I will open another issue. JIT uses up the scratch memory (1.5GB) when compiling some methods in the test.

Akira1Saitoh avatar Nov 11 '22 00:11 Akira1Saitoh

I opened https://github.com/eclipse-openj9/openj9/issues/16309.

Akira1Saitoh avatar Nov 11 '22 09:11 Akira1Saitoh

@Akira1Saitoh once https://github.com/eclipse-openj9/openj9/issues/16309 is resolved, will we revert https://github.com/adoptium/aqa-tests/pull/4119?

llxia avatar Nov 11 '22 13:11 llxia

@llxia yes, but I do not think we can resolve #16309 anytime soon.

Akira1Saitoh avatar Nov 11 '22 13:11 Akira1Saitoh

Can this issue (#15268) be closed now given we have #16309 ?

vij-singh avatar Nov 22 '22 14:11 vij-singh

With the test case change, it sounds like this issue should be closed.

0xdaryl avatar Nov 22 '22 14:11 0xdaryl