infrastructure icon indicating copy to clipboard operation
infrastructure copied to clipboard

Hotspot test serviceability/sa/ClhsdbCDSCore.java hangs on Ubuntu 16.04/x64

Open zzambers opened this issue 1 year ago • 13 comments

I can see, that this test hangs on adoptium infra, being killed on timeout (seems reliable): serviceability/sa/ClhsdbCDSCore.java

I can see this both in dev.openjdk run and when ran in grinder.

Output:

Starting ClhsdbCDSCore test
Command line: [/home/jenkins/workspace/Grinder/jdkbinary/j2sdk-image/bin/java -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17265736059645/hotspot_custom_0/work/classes/0/serviceability/sa/ClhsdbCDSCore.d:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/hotspot/jtreg/serviceability/sa:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17265736059645/hotspot_custom_0/work/classes/0/test/lib:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/lib:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/javatest.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jtreg.jar -ea -esa -Xmx512m -XX:+UseCompressedOops -Xshare:dump -Xlog:cds,cds+hashtables -XX:SharedArchiveFile=./ArchiveForClhsdbCDSCore.jsa ]
[2024-09-17T11:46:52.145720Z] Gathering output for process 25719
[ELAPSED: 447 ms]
[logging stdout to serviceability.sa.ClhsdbCDSCore.java-0000-dump.stdout]
[logging stderr to serviceability.sa.ClhsdbCDSCore.java-0000-dump.stderr]
[STDERR]

[2024-09-17T11:46:52.603422Z] Waiting for completion for process 25719
[2024-09-17T11:46:52.603687Z] Waiting for completion finished for process 25719
Command line: [/home/jenkins/workspace/Grinder/jdkbinary/j2sdk-image/bin/java -cp /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17265736059645/hotspot_custom_0/work/classes/0/serviceability/sa/ClhsdbCDSCore.d:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/hotspot/jtreg/serviceability/sa:/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17265736059645/hotspot_custom_0/work/classes/0/test/lib:/home/jenkins/workspace/Grinder/aqa-tests/openjdk/openjdk-jdk/test/lib:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/javatest.jar:/home/jenkins/workspace/Grinder/jvmtest/openjdk/jtreg/lib/jtreg.jar -ea -esa -Xmx512m -XX:+UseCompressedOops -Xmx512m -XX:+UnlockDiagnosticVMOptions -XX:SharedArchiveFile=ArchiveForClhsdbCDSCore.jsa -XX:+CreateCoredumpOnCrash -Xshare:auto -XX:+ProfileInterpreter --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED -XX:-AlwaysPreTouch CrashApp ]
[2024-09-17T11:46:52.610596Z] Gathering output for process 25735
[2024-09-17T11:46:52.611510Z] Waiting for completion for process 25735
[2024-09-17T11:46:52.628039Z] Waiting for completion finished for process 25735
Run test with ulimit -c: unlimited
[2024-09-17T11:46:52.630845Z] Gathering output for process 25738
Timeout signalled after 19200 seconds

Notes: I have tried to reproduce this locally or on our ifra both manually invoking jtreg and through aqa-tests, but failed to reproduce it. Maybe it is inra/environment issue? Test first intentionally crashes the VM using Unsafe class to produce core file. However this hangs when ran on adoptium infra. Maybe something with core dump settings? I don't know.

zzambers avatar Sep 18 '24 13:09 zzambers

This could be related to JDK-8283410, but on Adoptium infra it seems to affect linux (not windows?).

zzambers avatar Sep 18 '24 13:09 zzambers

@zzambers I did run it on a different agent ClhsdbCDSCore.java and it passed https://ci.adoptium.net/view/Test_grinder/job/Grinder/10970/ ( failed one is due to no test selected.) So it might be related with infra as you can't reproduce it on your environment. Could you please move it to infra repo? Or I can move it if you agree?

sophia-guo avatar Sep 23 '24 19:09 sophia-guo

@sophia-guo by moving you mean filling the same issue there and closing this one?

zzambers avatar Sep 24 '24 20:09 zzambers

There is a transfer issue link at the right side of the issue. Screenshot 2024-09-25 at 9 37 10 AM

I'm not sure if it's clickable for you as it might be related with the permission. I will just do this.

sophia-guo avatar Sep 25 '24 13:09 sophia-guo

@zzambers I did run it on a different agent ClhsdbCDSCore.java and it passed https://ci.adoptium.net/view/Test_grinder/job/Grinder/10970/ ( failed one is due to no test selected.) So it might be related with infra as you can't reproduce it on your environment. Could you please move it to infra repo? Or I can move it if you agree?

@sophia-guo Can you get a list of which machines/distributions it passes and fails on? Your one was run on RHEL. Both of zzambers' runs were on an (old, out of support) Ubuntu distribution (although neither were in containers). At the moment I'm not sure we have enough information to be able to be able to take an action this one in the infrastructure repo since it's not clear what is needed to resolve it.

sxa avatar Oct 05 '24 09:10 sxa

There are recent dev.hotspot runs which look clean - was this test removed and is it still considered a problem?

sxa avatar Nov 22 '24 15:11 sxa

I tried kicking off some grinders for testing (based on JDK11 since that's what the dev.openjdk link in the description was pointing at but got 15:21:35 Error: Cannot find file: /home/jenkins/workspace/Grinder/aqa-tests/TKG/../openjdk/openjdk-jdk/test/jdk/serviceability/sa/ClhsdbCDSCore.java which suggests that this test may no longer be valid:

  • ~~RHEL7: https://ci.adoptium.net/job/Grinder/11742/console~~
  • ~~Ubuntu 24.04: https://ci.adoptium.net/job/Grinder/11745/console~~
  • ~~Ubuntu 22.04 container (Forced job to run without sw.tool.docker): https://ci.adoptium.net/job/Grinder/11750/console~~

sxa avatar Nov 22 '24 15:11 sxa

There are recent dev.hotspot runs which look clean - was this test removed and is it still considered a problem?

ping @sophia-guo @zzambers - is this still a concern?

sxa avatar Nov 29 '24 13:11 sxa

It's a hotspot tests. So rerun with hotspot_custom

  • RHEL7: https://ci.adoptium.net/job/Grinder/11902/console - hang
  • Ubuntu 24.04: https://ci.adoptium.net/job/Grinder/11903/console - passed
  • Ubuntu 22.04 container (Forced job to run without sw.tool.docker): https://ci.adoptium.net/job/Grinder/11904/console passed

sophia-guo avatar Dec 02 '24 21:12 sophia-guo

It's a hotspot tests. So rerun with hotspot_custom

I don't think I've ever looked at a test that needed that before. Thanks for the pointer. Is there any way I can tell from the name of the test which ones need to use hotspot_custom instead of jdk_custom?

Both of the Ubuntu ones look like they have a pass although it's overall UNSTABLE ... Does this mean it's just not valid for the _1 variant?

21:31:51  TEST TARGETS SUMMARY
21:31:51  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
21:31:51  PASSED test targets:
21:31:51  	hotspot_custom_0 - Test results: passed: 1 
21:31:51  
21:31:51  FAILED test targets:
21:31:51  	hotspot_custom_1

sxa avatar Dec 02 '24 21:12 sxa

The only way to know if it's hotspot or jdk one is to check the test path. If it's under https://github.com/openjdk/jdk11u-dev/tree/master/test/hotspot ( jdk 11+) https://github.com/openjdk/jdk8u-dev/tree/master/hotspot/test (jdk8)then it's hotspot. If it's under https://github.com/openjdk/jdk11u-dev/tree/master/test/jdk (jdk11+) or https://github.com/openjdk/jdk8u-dev/tree/master/jdk/test (jdk8)then it's jdk.

hotspot_custom_1 can be ignored for this test as CDS only works when the Compressed Oops feature was enable for jdk14- ( works with either configuration of Compressed OOPs with jdk15+). So test is skipped.

sophia-guo avatar Dec 02 '24 21:12 sophia-guo

* RHEL7: https://ci.adoptium.net/job/Grinder/11902/console -  **_hang_**

That looks like it is running on Ubuntu 16.04, not RHEL7 😕

https://ci.adoptium.net/job/Grinder/11902/console

OK so we can reproduce but only on certain OSs. Some more:

provider/OS Grinder Result
ibmcloud-rhel6 11906 ✅
ibmcloud-rhel7 11905 ✅
docker-centos7 11907 ✅
aws-rhel8 11908 ✅
docker-ubi9 11909 ✅
docker-ubuntu2004 11910 ✅

And a few Ubuntus on ppc64le:

provider/OS Grinder Result
osuosl-ubuntu1604 11914 ✅
osuosl-ubuntu1804 11915 ✅
osuosl-ubuntu2004 11916 ✅

sxa avatar Dec 02 '24 22:12 sxa

Yes, on https://ci.adoptium.net/computer/test%2Dibmcloud%2Dubuntu1604%2Dx64%2D1/ it's timeout and failed. I just rerun the grinder you mentioned here https://github.com/adoptium/infrastructure/issues/3745#issuecomment-2494019804 and had thought the grinder is specified on RHEL7. Anyway tests timeout on some os.

sophia-guo avatar Dec 03 '24 21:12 sophia-guo