Create new jobs to handle running Solaris tests via a Linux proxy machine
This covers the implementation of what has been discussed in https://github.com/adoptium/infrastructure/issues/3742#issuecomment-2529132798
Current status: Prototype jobs have been created for x64 and SPARC at:
- https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest
- https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simpletest
These are currently connecting to the target machine as the vagrant user (even on SPARC where I've created a user with that name for consistency).
Jobs are currently set up to run the full AQA suite of tests instead of running as individual jobs but that can be changed later if desired and archive the artefacts
Prototype now working (although hard coded to a specific tag. There is a dotests.x64.sh script and a dotests.sparcv9 script on the proxy host which is copied across to the target machine using scp as dotests.sh and that is executed.
It needs to be parameterised to be able to take the tag/URL as a parameter (for use when retrieving the artifact directly from jenkins, since we can't use copyArtifact) but otherwise it works.
It currently has a loop which can loop over each suite that is required and then copy the TAP output back to the proxy machine for archiving. The proxy agent is running as the solaris user on dockerhost-azure-ubuntu2204-x64-1
Verification (Solaris/x64)
Failures (based on looking at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/50/tapResults:
- sanity.openjdk:
- jdk/lambda/vm/InterfaceAccessFlagsTest.java
- sanity.system (All due to being unable to locate
mauve/mauve.jar:- MauveSingleThrdLoad_HS_5m_0
- MauveSingleInvocLoad_HS_5m_0
- MauveMultiThrdLoad_5m_0
- extended.system (All except OAuthTest_0 are due to being unabel to locate
mauve/mauve.jar:- MiniMix_5m_0
- MiniMix_10m_0
- MiniMix_aot_5m_0
- OAuthTest_0
- extended.openjdk:
- jdk_beans_0 (10 failures - mostly font/color related)
- jdk_security3_0 (
javax/net/ssl/ciphersuites/DisabledAlgorithms.java) - jdk_management_0 (
sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh) - jdk_imageio_0 (2 failures -
plugins/jpeg/JPEGsNotAcceleratedTest.javaandjavax/imageio/AppletResourceTest.java)
- special.openjdk:
- jdk_math_jre_0 (
Error: JDK not found)
- jdk_math_jre_0 (
Solaris/SPARC
The SPARC run had mostly the same failures although OAuthTest_0 in the extended.system suite didn't fail as it was skipped. Two other failures:
- extended.openjdk:
- hotspot_jdk_0 `serviceability/sa/jmap-hashcode/Test8028623.java
- jdk_security3_0:
sun/security/ssl/SSLSocketImpl/ClientSocketCloseHang.java(NOTE: Different failure in this suite from the x64 run)
These test failures were seen in the old Solaris pipelines:
Sparc:
- Test8028623.java Example.
x64:
These test failures appear to be new:
Sparc:
- ClientSocketCloseHang.java Example.
x64: in progress
- InterfaceAccessFlagsTest.java Example.
- jdk_math_jre_0 Example.
- jdk_beans_0 Example
- AppletResourceTest Example
- All the "could not find mauve jar" issues, though we did have an instance of "could not find stf.pl in those extended tests. Example.
- OAuthTest_0 timeout (connection refused errors are common in the old pipeline, though). Example.
@adamfarley FYI I've brought the "normal" solaris jenkins agents for the test boxes back online in case you want to try anything via Grinder
@adamfarley Also if you're going to run grinders it would probably be good to compare on the last published EA ones and the "new" builds from my pipelines in https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simplepipe/ and https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simplepipe/ just in case there's something wrong with the build itself.
Sure thing.
TLDR: No problems on the new build that weren't on the old build (when run in Grinder, anyway).
Details:
x64:
OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error Note: The jdk_custom appears to have run wrong (test/tkg bug, I think) for both builds, so here's are the reruns (which all passed): Old build, new build.
System result:
Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory
This error occurs with both old and new builds, so the framework is equally broken in both cases. :/
sparc:
OpenJDK result: Targets passed, except for math jre, which failed on both builds. Custom reruns are here: Old build, new build.
System result: Same as above. stf.pl not found.
Test jobs currently being run from the top level "simplepipe" pipelines with propagate: false due the job failing with an ERRORstate if one suite fails which is causing the pipeline to not continue to the following steps e.g.
15:30:04 TOTAL: 23 EXECUTED: 9 PASSED: 8 FAILED: 1 DISABLED: 0 SKIPPED: 14
15:30:04 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:05
15:30:06 TESTCASES RESULTS SUMMARY: passed: 4,947; failed: 1; error: 0; skipped: 0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 To rebuild the failed test in a jenkins job, copy the following link and fill out the <Jenkins URL> and <FAILED test target>:
15:30:06 <Jenkins URL>/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=<FAILED test target>
15:30:06
15:30:06 For example, to rebuild the failed tests in <Jenkins URL>=https://ci.adoptium.net/job/Grinder, use the following links:
15:30:06 https://ci.adoptium.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=jdk_lang_0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 gmake[1]: *** [settings.mk:450: resultsSummary] Error 2
15:30:06 gmake[1]: Leaving directory '/export/home/vagrant/aqa-tests/TKG'
15:30:06 gmake: *** [makefile:62: _sanity.openjdk] Error 2
I will continue to look at resolving that, but in the meantin propgate: false seems to work, although it means the yellow warning status is not shown in the pipeline block..
OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error
To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.
These test failures appear to be new:
Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?
OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error
To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.
Yes, the tests that failed in the simple pipeline have passed in the grinder job (except for jck_math, which fails with both normal builds and simple pipeline builds).
These test failures appear to be new:
Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?
The "cannot find mauve" issue appears to have been first seen on the simple pipelines, but I can't prove that it only affects the simple pipelines because the grinder can't find stf.pl, and that issue was only seen once before.
Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory
This was potentially caused by the /tmp/mauve directory which was present on the machine but owned by a different user (The new pipelines are NOT running as the jenkins user). Removing that directory caused the tests to run through properly:
Passing test | Previous failing test
There is still an issue relating to the cannot find mauve.jar which appears to only be happening from the new jobs. I wonder if it's because I'm running without a WORKSPACE variable set, so it's choosing to put things Continuing investigation.
Noting that the parsing of the df output is not quite correct on Solaris. It's giving this message:
Test machine has only 1893 Mb free on drive containing /export/home/local/vagrant/aqa-tests/TKG/../TKG/output_17369506484404/TestJlmLocal_0.
There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
files in the event of a test failure.
despite parsing the df output with df_header of:
Filesystem 1024-blocks Used Available Capacity Mounted on
and df_body of
/dev/dsk/c1t0d0s7 23510982 1939385 21336488 9% /export/home
Based on the detection of 1893 Mb free it seems likely that it is parsing the used instead of available field, so as a temporary workaround I'll fill up the space for a while ;-)
I think the problem with stf.pl may have disappeared by either:
- removing
/opt/xpg4/binfrom the PATH (where the "other"dfcommand was). - adding a definition of
$WORKSPACEpointing to$HOME/workspacealthough I'm not convinced it is doing much with that. I've also added smoke test functionality to the test job which is described by the commands being added to the FAQ in https://github.com/adoptium/infrastructure/pull/3860 so that is now also being run,
https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/62/console is running with the smoke tests and sanity.system
Notes:
- The smoke test on Solaris only runs
Java_Version_0and notAdopt_HS_FeatureTests_0 - The installed version of
antin /usr/local/bin is not used by the system tests. stf seems to ALWAYS download version 1.10.2 (earlier than what we now install elsewhere). It goes into /var/tmp
The simpletest jobs are now back to the original mauve.jar error on both x64 and SPARC: on the three mauve tests
FAILED test targets:
MauveSingleThrdLoad_HS_5m_0
MauveSingleInvocLoad_HS_5m_0
MauveMultiThrdLoad_5m_0
The mauve references in the log are as follows:
21:11:44 check-if-already-built:
21:11:44 [echo] Checking if /export/home/vagrant/aqa-tests/systemtest_prereqs/mauve/mauve.jar already exists
21:11:44 [echo] openjdk_test_mauve_already_built is ${openjdk_test_mauve_already_built}
[...]
21:11:46 check-if-work_jar_file-exists:
21:11:46 [echo] Checking if /var/tmp//mauve/mauve.jar exists
21:11:46 [echo] openjdk_test_mauve_work_jar_file_exists is ${openjdk_test_mauve_work_jar_file_exists}
[...]
21:12:01 GEN 16:27:31.003 - Using Mode NoOptions. Values = ''
21:12:01 GEN stderr Exception in thread "main" net.adoptopenjdk.stf.StfException: Note: file 'mauve/mauve.jar' could not be found in any of the supplied test roots: '/export/home/vagrant/jvmtest/system/systemtest_prereqs'
The original sanity.system jobs are still failing locating stf.pl
12:55:06 Can't open perl script "/export/home/jenkins/workspace/Test_openjdk8_hs_sanity.system_x86-64_solaris/aqa-tests/TKG/../../jvmtest/system/mauveLoadTest/..//STF/stf.core/scripts/stf.pl": No such file or directory
12:55:06 -----------------------------------
12:55:06 MauveSingleThrdLoad_HS_5m_0_FAILED
12:55:06 -----------------------------------
Also from simpletest#63 - this is coming from the code in https://github.com/adoptium/aqa-systemtest/blob/5279e7ee7ddf2a4381f8e5c650b4c13b239f00f7/openjdk.test.mauve/build.xml#L362 and may indicate a CVS retrieval issue:
delete-work-dir:
[delete] Deleting directory /var/tmp/mauve
create-work-dir:
[mkdir] Created dir: /var/tmp/mauve
get-source:
[exec] Could not read password for host: java.io.FileNotFoundException: /export/home/local/vagrant/.cvspass (No such file or directory)
[exec] Cannot connect to host sourceware.org:2401.
[exec] Result: 1
check-if-source-available:
[echo] Checking if /var/tmp//mauve/mauve/gnu/testlet/config.java.in exists
[echo] mauve_source_available is ${mauve_source_available}
By comparison, this is from a passing run on Linux/x64:
21:44:33 delete-work-dir:
21:44:33
21:44:33 create-work-dir:
21:44:33
21:44:33 get-source:
21:44:33
21:44:33 check-if-source-available:
21:44:33 [echo] Checking if /tmp/mauve/mauve/gnu/testlet/config.java.in exists
21:44:33 [echo] mauve_source_available is ${mauve_source_available}
21:44:33
Note also that the inability to resolve some of these variables appears unique to the recent runs of simpletest and did not occur in the last successful "normal" Solaris/x64 run at https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.system_x86-64_solaris/401/consoleFull
There are a small number of test cases (mostly in java_beans) which are failing due to the absence of a DISPLAY variable - I've started an Xvfb on :5 on both machines and adjusted the dotests.sh script to point at that so hopefully the next runs will be better.
Summary from performing triage on the January dry-runs using builds created with the new pipelines:
- as per the last comment, Xvfb wasn't being started as that is typically done via the pipelines. This was causing a small number of tests in
java_beans_0andjava_imageio_0to fail. To mitigate that for now I have manually started anXvfbon display:5and hard coded the setting ofDISPLAY=:5in the environment indotests.sh - Some of the system tests (Mauve in sanity, MiniMix in extended) require
mauve.jarto be present and that is not put in place by the normalmake compile; make TARGETprocess, so I have added in an explicitcurlof mauve from thesystemtest.getDependencyjob:
curl -o `pwd`/aqa-tests/systemtest_prereqs/mauve/mauve.jar \
https://ci.adoptium.net/job/systemtest.getDependency/lastSuccessfulBuild/artifact/systemtest_prereqs/mauve/mauve.jar
This should significant reduce the number of test failures that are outstanding.
Latest version of the dotests.sh script which is used by the simpletest jobs is: dotests.sh.txt
Note that this currently relies on the Xvfb already being run as DISPLAY :5
Grinders:
- Solaris/x64 8u442-b05 passed
InterfaceAccessFlagsTestand ClientSocketCloseHang but notCancelledLockLoops,SSLEngingeExplorerWithCli,DisabledAlgorithms,SSLEngineDeadlockandClientSocketCloseHang - Solaris/SPARC 8u442-b05 passed 10 iterations of CancelledLockLoops,
SSLEngingExplorerWithCli,DisabledAlgorithmsInterfaceAccessFlagsTestandSSLEngineDeadlockbut only 2/20 ofClientSocketCloseHang - Solaris/x64 passed
Test8028623 - Solaris/SPARC failed 2/10 of
Test8028623 - Solaris/x64: Passed
CancelledLockLoops,InterfaceAccessFlagsTest,ClientSocketCloseHang, FAILEDDisabledAlgorithms, Failed 2/10 ofSSLEngineExmplorerWithCliandSSLEngineDeadlock
Final version of the dotests script:
Other issues that have been seen are covered in the x64 and SPARC AQA triage issues for the January release.
I have raised issues for DisabledAlgorithms, InterfaceAccessFlagsTest and SSLConfigFilePermissionTest.
New run using the latest 8u452 build:
Previous x64 run which didn't have an X display available
[Solaris/x64 failures](https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/78/tapTestReport/): - jdk_lang_0 `jdk/lambda/vm/InterfaceAccessFlagsTest.java` - jdk_util_0 `java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java` and `java/util/Currency/ValidateISO4217.java` - jdk_beans_0 Missing display :5 - jdk_security3_0 `javax/net/ssl/ciphersuites/DisabledAlgorithms.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java` - jdk_management_0 `sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh` - jdk_imageio_0 Missing display :5Re-running with an Xvfb startup fix at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/88/
Solaris/x64 failures
- jdk_lang_0: jdk/lambda/vm/InterfaceAccessFlagsTest.java
- jdk_util_0: java/util/Currency/ValidateISO4217.java
- jdk_security3_0:
- TEST: javax/net/ssl/ciphersuites/DisabledAlgorithms.java
- TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java
- TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java
- TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java
- jdk_management_0: sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh
- jdk_lang_0
jdk/lambda/vm/InterfaceAccessFlagsTest.java - jdk_util_0
java/util/Currency/ValidateISO4217.java - jdk_security3_0 Failures with
sun/security/ssl/SSLSocketImpl/ClientSocketCloseHang.java,sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java,sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java,sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.javaincluding08:13:58 ACTION: main -- Failed. Execution failed: 'main' threw exception: java.lang.RuntimeException: Unexpected exception: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested targetNOTE:CamerfirmaendEntrusttests have now been removed under https://github.com/adoptium/aqa-tests/pull/6186/files
SPARC Grinder with the three failed targets: https://ci.adoptium.net/job/Grinder/12879/ passed jdk_lang_0 (InterfaceAccessFlagsTest) as expected (based on the next comment!) and passed some but not all of the X509TrustManagerImpl tests) Will run Grinder with just those tests listed here along with the InterfaceAccessFlags one and one or two of the other jdk_security3_0 failures we had from the non-Grinder run for 100 iterations to see how reliable the failures are.
12:42:00 PASSED test targets:
12:42:00 jdk_lang_0 - Test results: passed: 474
12:42:00
12:42:00 FAILED test targets:
12:42:00 jdk_security3_0 - Test results: passed: 606; failed: 1
12:42:00 Failed test cases:
12:42:00 TEST: sun/security/ssl/X509TrustManagerImpl/Entrust/Distrust.java
12:42:00
12:42:00 jdk_util_0 - Test results: passed: 675; failed: 2
12:42:00 Failed test cases:
12:42:00 TEST: java/util/Currency/ValidateISO4217.java
12:42:00 TEST: java/util/TimeZone/AssureTzdataVersion.java
100 run Grinder: https://ci.adoptium.net/job/Grinder/12881/console - FAILED due to it not liking sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java in the CUSTOM_TARGET field (Maybe needs hotspot_custom? or We've hit a rename somewhere?) so running without it at https://ci.adoptium.net/job/Grinder/12882/console for now:
AssureTzdataVersion and Entrust/Distrust both fail on Linux/aarch64 so are not specific to these Solaris runs:
Relevant comments from a slack thread where this was analysed for the January cycle - noting that the java_util failures are not covered by this:
I've managed to beat almost everything into submission but there are two tests putting up a bit of a fight plus one erroneous (?) suite: - ~~jdk_math_jre_0 got kicked off as part of my runs of the normal test targets but it doesn't seem to work. I suspect that one's not supposed to have been run so likely user error.~~ - jdk/lambda/vm/InterfaceAccessFlagsTest.java fails on both x64 and SPARC although I do now have what appears to be a pass on SPARC here. We'll see if a re-run of that job on x64 runs when this Grinder completes. Noting that this test has been excluded for linux-all - javax/net/ssl/ciphersuites/DisabledAlgorithms.java from jdk_security3_0 is putting up more of a fight (Passes on SPARC, not x64). The error is: Expected SSL exception not thrown on server side. It looks like it used to pass with the material from 8u422 and earlier but now doesn't, so I'm not sure why it didn't show up in the October release triage. Maybe it just got missed. Maybe it's a security.properties issue, although I'd expect it to be consistent across architectures if that was the case.
Looks like the InterfaceAccessFlagsTest Grinder has some passes in it (Up to about 30 iterations of 100) so I think we can probably cross that one off the list. ~~Only question mark is therefore on DisabledAlgorithms.java which is not a regression from 8u432~~ EDIT: https://github.com/adoptium/aqa-tests/issues/5915 covers an upstream change that caused this to stop working
Some resources for the ValidateISO4217 test failure - there is nothing recent but the comments might indicate a test material sync issue:
- https://github.com/adoptium/aqa-tests/issues/678
- https://github.com/adoptium/aqa-tests/issues/3640#issuecomment-1151717530
CancelledLockLoops was noted as a one-off failure on Linux/arm32 JDK11 at https://github.com/adoptium/aqa-tests/issues/5691#issuecomment-2418402821 in the October 2024 release cycle - hopefully that will be a on-off here too..
SSLConfigFilePermissionTest is the issue with the port being blocked by the dhcpagent on Solaris/x64 so not considered a functional problem.
Note: java/util/Currency/ValidateISO4217.java is an upstream platform-neutral issue. Exclude requested at https://github.com/adoptium/aqa-tests/pull/6179 so that can be ignored for the purposes of this.
The other two failures are also new since the last GA so may be related to the ValidateISO4217 failure. I have also verified at https://ci.adoptium.net/job/Grinder/12890/console that these are NOT specific to Solaris.
Easy way to summarise the consoleText log:
awk '/FAILED test targets:/,/TOTAL:/'
Additional work to allow downloading of restricted release build jobs is being done under https://github.com/adoptium/temurin-build/pull/4253 by copying the job artefacts to the proxy machine and then 'scp'ing them across to the target. This will close off the final action for the test jobs.
Both simpletest jobs have been adjusted to include a copyartifacts step and take UPSTREAM_JOB_NAME/UPSTREAM_JOB_NAME as parameters.
This is complete but apparently I didn't close the issue previously :-)