temurin-build icon indicating copy to clipboard operation
temurin-build copied to clipboard

Create new jobs to handle running Solaris tests via a Linux proxy machine

Open sxa opened this issue 1 year ago • 22 comments

This covers the implementation of what has been discussed in https://github.com/adoptium/infrastructure/issues/3742#issuecomment-2529132798

Current status: Prototype jobs have been created for x64 and SPARC at:

  • https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest
  • https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simpletest

These are currently connecting to the target machine as the vagrant user (even on SPARC where I've created a user with that name for consistency).

Jobs are currently set up to run the full AQA suite of tests instead of running as individual jobs but that can be changed later if desired and archive the artefacts

sxa avatar Dec 20 '24 12:12 sxa

Prototype now working (although hard coded to a specific tag. There is a dotests.x64.sh script and a dotests.sparcv9 script on the proxy host which is copied across to the target machine using scp as dotests.sh and that is executed.

It needs to be parameterised to be able to take the tag/URL as a parameter (for use when retrieving the artifact directly from jenkins, since we can't use copyArtifact) but otherwise it works. It currently has a loop which can loop over each suite that is required and then copy the TAP output back to the proxy machine for archiving. The proxy agent is running as the solaris user on dockerhost-azure-ubuntu2204-x64-1

sxa avatar Jan 07 '25 18:01 sxa

Verification (Solaris/x64)

Failures (based on looking at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/50/tapResults:

  • sanity.openjdk:
    • jdk/lambda/vm/InterfaceAccessFlagsTest.java
  • sanity.system (All due to being unable to locate mauve/mauve.jar:
    • MauveSingleThrdLoad_HS_5m_0
    • MauveSingleInvocLoad_HS_5m_0
    • MauveMultiThrdLoad_5m_0
  • extended.system (All except OAuthTest_0 are due to being unabel to locate mauve/mauve.jar:
    • MiniMix_5m_0
    • MiniMix_10m_0
    • MiniMix_aot_5m_0
    • OAuthTest_0
  • extended.openjdk:
    • jdk_beans_0 (10 failures - mostly font/color related)
    • jdk_security3_0 (javax/net/ssl/ciphersuites/DisabledAlgorithms.java)
    • jdk_management_0 (sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh)
    • jdk_imageio_0 (2 failures - plugins/jpeg/JPEGsNotAcceleratedTest.java and javax/imageio/AppletResourceTest.java)
  • special.openjdk:
    • jdk_math_jre_0 (Error: JDK not found)

Solaris/SPARC

The SPARC run had mostly the same failures although OAuthTest_0 in the extended.system suite didn't fail as it was skipped. Two other failures:

  • extended.openjdk:
    • hotspot_jdk_0 `serviceability/sa/jmap-hashcode/Test8028623.java
    • jdk_security3_0: sun/security/ssl/SSLSocketImpl/ClientSocketCloseHang.java (NOTE: Different failure in this suite from the x64 run)

sxa avatar Jan 10 '25 09:01 sxa

These test failures were seen in the old Solaris pipelines:

Sparc:

x64:


These test failures appear to be new:

Sparc:

  • ClientSocketCloseHang.java Example.

x64: in progress

  • InterfaceAccessFlagsTest.java Example.
  • jdk_math_jre_0 Example.
  • jdk_beans_0 Example
  • AppletResourceTest Example
  • All the "could not find mauve jar" issues, though we did have an instance of "could not find stf.pl in those extended tests. Example.
  • OAuthTest_0 timeout (connection refused errors are common in the old pipeline, though). Example.

adamfarley avatar Jan 10 '25 11:01 adamfarley

@adamfarley FYI I've brought the "normal" solaris jenkins agents for the test boxes back online in case you want to try anything via Grinder

sxa avatar Jan 10 '25 11:01 sxa

@adamfarley Also if you're going to run grinders it would probably be good to compare on the last published EA ones and the "new" builds from my pipelines in https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simplepipe/ and https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simplepipe/ just in case there's something wrong with the build itself.

sxa avatar Jan 10 '25 12:01 sxa

Sure thing.

TLDR: No problems on the new build that weren't on the old build (when run in Grinder, anyway).

Details:

x64:

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error Note: The jdk_custom appears to have run wrong (test/tkg bug, I think) for both builds, so here's are the reruns (which all passed): Old build, new build.

System result:

Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory

This error occurs with both old and new builds, so the framework is equally broken in both cases. :/

sparc:

OpenJDK result: Targets passed, except for math jre, which failed on both builds. Custom reruns are here: Old build, new build.

System result: Same as above. stf.pl not found.

adamfarley avatar Jan 10 '25 12:01 adamfarley

Test jobs currently being run from the top level "simplepipe" pipelines with propagate: false due the job failing with an ERRORstate if one suite fails which is causing the pipeline to not continue to the following steps e.g.

15:30:04 TOTAL: 23   EXECUTED: 9   PASSED: 8   FAILED: 1   DISABLED: 0   SKIPPED: 14
15:30:04 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:05 
15:30:06 TESTCASES RESULTS SUMMARY: passed: 4,947; failed: 1; error: 0; skipped: 0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 To rebuild the failed test in a jenkins job, copy the following link and fill out the <Jenkins URL> and <FAILED test target>:
15:30:06 <Jenkins URL>/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=<FAILED test target>
15:30:06 
15:30:06 For example, to rebuild the failed tests in <Jenkins URL>=https://ci.adoptium.net/job/Grinder, use the following links:
15:30:06 https://ci.adoptium.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=jdk_lang_0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 gmake[1]: *** [settings.mk:450: resultsSummary] Error 2
15:30:06 gmake[1]: Leaving directory '/export/home/vagrant/aqa-tests/TKG'
15:30:06 gmake: *** [makefile:62: _sanity.openjdk] Error 2

I will continue to look at resolving that, but in the meantin propgate: false seems to work, although it means the yellow warning status is not shown in the pipeline block..

sxa avatar Jan 13 '25 10:01 sxa

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error

To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.

These test failures appear to be new:

Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?

sxa avatar Jan 13 '25 10:01 sxa

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error

To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.

Yes, the tests that failed in the simple pipeline have passed in the grinder job (except for jck_math, which fails with both normal builds and simple pipeline builds).

These test failures appear to be new:

Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?

The "cannot find mauve" issue appears to have been first seen on the simple pipelines, but I can't prove that it only affects the simple pipelines because the grinder can't find stf.pl, and that issue was only seen once before.

adamfarley avatar Jan 14 '25 14:01 adamfarley

Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory

This was potentially caused by the /tmp/mauve directory which was present on the machine but owned by a different user (The new pipelines are NOT running as the jenkins user). Removing that directory caused the tests to run through properly: Passing test | Previous failing test There is still an issue relating to the cannot find mauve.jar which appears to only be happening from the new jobs. I wonder if it's because I'm running without a WORKSPACE variable set, so it's choosing to put things Continuing investigation.

sxa avatar Jan 14 '25 17:01 sxa

Noting that the parsing of the df output is not quite correct on Solaris. It's giving this message:

Test machine has only 1893 Mb free on drive containing /export/home/local/vagrant/aqa-tests/TKG/../TKG/output_17369506484404/TestJlmLocal_0.

There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
files in the event of a test failure.

despite parsing the df output with df_header of:

Filesystem           1024-blocks        Used   Available Capacity  Mounted on

and df_body of

/dev/dsk/c1t0d0s7       23510982     1939385    21336488     9%    /export/home

Based on the detection of 1893 Mb free it seems likely that it is parsing the used instead of available field, so as a temporary workaround I'll fill up the space for a while ;-)

sxa avatar Jan 15 '25 14:01 sxa

I think the problem with stf.pl may have disappeared by either:

  • removing /opt/xpg4/bin from the PATH (where the "other" df command was).
  • adding a definition of $WORKSPACE pointing to $HOME/workspace although I'm not convinced it is doing much with that. I've also added smoke test functionality to the test job which is described by the commands being added to the FAQ in https://github.com/adoptium/infrastructure/pull/3860 so that is now also being run,

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/62/console is running with the smoke tests and sanity.system

Notes:

  1. The smoke test on Solaris only runs Java_Version_0 and not Adopt_HS_FeatureTests_0
  2. The installed version of ant in /usr/local/bin is not used by the system tests. stf seems to ALWAYS download version 1.10.2 (earlier than what we now install elsewhere). It goes into /var/tmp

sxa avatar Jan 15 '25 18:01 sxa

The simpletest jobs are now back to the original mauve.jar error on both x64 and SPARC: on the three mauve tests

FAILED test targets:
	MauveSingleThrdLoad_HS_5m_0
	MauveSingleInvocLoad_HS_5m_0
	MauveMultiThrdLoad_5m_0

The mauve references in the log are as follows:

21:11:44 check-if-already-built:
21:11:44      [echo] Checking if /export/home/vagrant/aqa-tests/systemtest_prereqs/mauve/mauve.jar already exists
21:11:44      [echo] openjdk_test_mauve_already_built is ${openjdk_test_mauve_already_built}
[...]
21:11:46 check-if-work_jar_file-exists:
21:11:46      [echo] Checking if /var/tmp//mauve/mauve.jar exists
21:11:46      [echo] openjdk_test_mauve_work_jar_file_exists is ${openjdk_test_mauve_work_jar_file_exists}
[...]
21:12:01 GEN 16:27:31.003 - Using Mode NoOptions. Values = ''
21:12:01 GEN stderr Exception in thread "main" net.adoptopenjdk.stf.StfException: Note: file 'mauve/mauve.jar' could not be found in any of the supplied test roots: '/export/home/vagrant/jvmtest/system/systemtest_prereqs'

The original sanity.system jobs are still failing locating stf.pl

12:55:06  Can't open perl script "/export/home/jenkins/workspace/Test_openjdk8_hs_sanity.system_x86-64_solaris/aqa-tests/TKG/../../jvmtest/system/mauveLoadTest/..//STF/stf.core/scripts/stf.pl": No such file or directory
12:55:06  -----------------------------------
12:55:06  MauveSingleThrdLoad_HS_5m_0_FAILED
12:55:06  -----------------------------------

Also from simpletest#63 - this is coming from the code in https://github.com/adoptium/aqa-systemtest/blob/5279e7ee7ddf2a4381f8e5c650b4c13b239f00f7/openjdk.test.mauve/build.xml#L362 and may indicate a CVS retrieval issue:

delete-work-dir:
   [delete] Deleting directory /var/tmp/mauve

create-work-dir:
    [mkdir] Created dir: /var/tmp/mauve

get-source:
     [exec] Could not read password for host: java.io.FileNotFoundException: /export/home/local/vagrant/.cvspass (No such file or directory)
     [exec] Cannot connect to host sourceware.org:2401.
     [exec] Result: 1

check-if-source-available:
     [echo] Checking if /var/tmp//mauve/mauve/gnu/testlet/config.java.in exists
     [echo] mauve_source_available is ${mauve_source_available}

By comparison, this is from a passing run on Linux/x64:

21:44:33  delete-work-dir:
21:44:33  
21:44:33  create-work-dir:
21:44:33  
21:44:33  get-source:
21:44:33  
21:44:33  check-if-source-available:
21:44:33       [echo] Checking if /tmp/mauve/mauve/gnu/testlet/config.java.in exists
21:44:33       [echo] mauve_source_available is ${mauve_source_available}
21:44:33 

Note also that the inability to resolve some of these variables appears unique to the recent runs of simpletest and did not occur in the last successful "normal" Solaris/x64 run at https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.system_x86-64_solaris/401/consoleFull

sxa avatar Jan 15 '25 23:01 sxa

There are a small number of test cases (mostly in java_beans) which are failing due to the absence of a DISPLAY variable - I've started an Xvfb on :5 on both machines and adjusted the dotests.sh script to point at that so hopefully the next runs will be better.

sxa avatar Jan 16 '25 16:01 sxa

Summary from performing triage on the January dry-runs using builds created with the new pipelines:

  • as per the last comment, Xvfb wasn't being started as that is typically done via the pipelines. This was causing a small number of tests in java_beans_0 and java_imageio_0 to fail. To mitigate that for now I have manually started an Xvfb on display :5 and hard coded the setting of DISPLAY=:5 in the environment in dotests.sh
  • Some of the system tests (Mauve in sanity, MiniMix in extended) require mauve.jar to be present and that is not put in place by the normal make compile; make TARGET process, so I have added in an explicit curl of mauve from the systemtest.getDependency job:
curl -o `pwd`/aqa-tests/systemtest_prereqs/mauve/mauve.jar \
    https://ci.adoptium.net/job/systemtest.getDependency/lastSuccessfulBuild/artifact/systemtest_prereqs/mauve/mauve.jar

This should significant reduce the number of test failures that are outstanding.

sxa avatar Jan 17 '25 10:01 sxa

Latest version of the dotests.sh script which is used by the simpletest jobs is: dotests.sh.txt Note that this currently relies on the Xvfb already being run as DISPLAY :5

sxa avatar Jan 17 '25 18:01 sxa

Grinders:

  • Solaris/x64 8u442-b05 passed InterfaceAccessFlagsTest and ClientSocketCloseHang but not CancelledLockLoops, SSLEngingeExplorerWithCli, DisabledAlgorithms, SSLEngineDeadlock and ClientSocketCloseHang
  • Solaris/SPARC 8u442-b05 passed 10 iterations of CancelledLockLoops, SSLEngingExplorerWithCli, DisabledAlgorithms InterfaceAccessFlagsTest and SSLEngineDeadlock but only 2/20 of ClientSocketCloseHang
  • Solaris/x64 passed Test8028623
  • Solaris/SPARC failed 2/10 of Test8028623
  • Solaris/x64: Passed CancelledLockLoops, InterfaceAccessFlagsTest, ClientSocketCloseHang, FAILED DisabledAlgorithms, Failed 2/10 of SSLEngineExmplorerWithCli and SSLEngineDeadlock

sxa avatar Jan 26 '25 12:01 sxa

Final version of the dotests script:

dotests.sh.txt

Other issues that have been seen are covered in the x64 and SPARC AQA triage issues for the January release.

I have raised issues for DisabledAlgorithms, InterfaceAccessFlagsTest and SSLConfigFilePermissionTest.

sxa avatar Jan 29 '25 17:01 sxa

New run using the latest 8u452 build:

Previous x64 run which didn't have an X display available [Solaris/x64 failures](https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/78/tapTestReport/): - jdk_lang_0 `jdk/lambda/vm/InterfaceAccessFlagsTest.java` - jdk_util_0 `java/util/concurrent/locks/ReentrantLock/CancelledLockLoops.java` and `java/util/Currency/ValidateISO4217.java` - jdk_beans_0 Missing display :5 - jdk_security3_0 `javax/net/ssl/ciphersuites/DisabledAlgorithms.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java`, `sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java` - jdk_management_0 `sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh` - jdk_imageio_0 Missing display :5

Re-running with an Xvfb startup fix at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/88/

Solaris/x64 failures

  • jdk_lang_0: jdk/lambda/vm/InterfaceAccessFlagsTest.java
  • jdk_util_0: java/util/Currency/ValidateISO4217.java
  • jdk_security3_0:
    • TEST: javax/net/ssl/ciphersuites/DisabledAlgorithms.java
    • TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java
    • TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java
    • TEST: sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java
  • jdk_management_0: sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh

Solaris/SPARC failures:

  • jdk_lang_0 jdk/lambda/vm/InterfaceAccessFlagsTest.java
  • jdk_util_0 java/util/Currency/ValidateISO4217.java
  • jdk_security3_0 Failures with sun/security/ssl/SSLSocketImpl/ClientSocketCloseHang.java, sun/security/ssl/X509TrustManagerImpl/distrust/Camerfirma.java, sun/security/ssl/X509TrustManagerImpl/distrust/Entrust.java, sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java including 08:13:58 ACTION: main -- Failed. Execution failed: 'main' threw exception: java.lang.RuntimeException: Unexpected exception: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target NOTE: Camerfirma end Entrust tests have now been removed under https://github.com/adoptium/aqa-tests/pull/6186/files

SPARC Grinder with the three failed targets: https://ci.adoptium.net/job/Grinder/12879/ passed jdk_lang_0 (InterfaceAccessFlagsTest) as expected (based on the next comment!) and passed some but not all of the X509TrustManagerImpl tests) Will run Grinder with just those tests listed here along with the InterfaceAccessFlags one and one or two of the other jdk_security3_0 failures we had from the non-Grinder run for 100 iterations to see how reliable the failures are.

12:42:00  PASSED test targets:
12:42:00  	jdk_lang_0 - Test results: passed: 474 
12:42:00  
12:42:00  FAILED test targets:
12:42:00  	jdk_security3_0 - Test results: passed: 606; failed: 1 
12:42:00  		Failed test cases: 
12:42:00  			TEST: sun/security/ssl/X509TrustManagerImpl/Entrust/Distrust.java
12:42:00          
12:42:00  	jdk_util_0 - Test results: passed: 675; failed: 2 
12:42:00  		Failed test cases: 
12:42:00  			TEST: java/util/Currency/ValidateISO4217.java
12:42:00          TEST: java/util/TimeZone/AssureTzdataVersion.java

100 run Grinder: https://ci.adoptium.net/job/Grinder/12881/console - FAILED due to it not liking sun/security/ssl/X509TrustManagerImpl/distrust/Symantec.java in the CUSTOM_TARGET field (Maybe needs hotspot_custom? or We've hit a rename somewhere?) so running without it at https://ci.adoptium.net/job/Grinder/12882/console for now:

Image

AssureTzdataVersion and Entrust/Distrust both fail on Linux/aarch64 so are not specific to these Solaris runs:

Image

sxa avatar Apr 10 '25 09:04 sxa

Relevant comments from a slack thread where this was analysed for the January cycle - noting that the java_util failures are not covered by this:

I've managed to beat almost everything into submission but there are two tests putting up a bit of a fight plus one erroneous (?) suite: - ~~jdk_math_jre_0 got kicked off as part of my runs of the normal test targets but it doesn't seem to work. I suspect that one's not supposed to have been run so likely user error.~~ - jdk/lambda/vm/InterfaceAccessFlagsTest.java fails on both x64 and SPARC although I do now have what appears to be a pass on SPARC here. We'll see if a re-run of that job on x64 runs when this Grinder completes. Noting that this test has been excluded for linux-all - javax/net/ssl/ciphersuites/DisabledAlgorithms.java from jdk_security3_0 is putting up more of a fight (Passes on SPARC, not x64). The error is: Expected SSL exception not thrown on server side. It looks like it used to pass with the material from 8u422 and earlier but now doesn't, so I'm not sure why it didn't show up in the October release triage. Maybe it just got missed. Maybe it's a security.properties issue, although I'd expect it to be consistent across architectures if that was the case.

Looks like the InterfaceAccessFlagsTest Grinder has some passes in it (Up to about 30 iterations of 100) so I think we can probably cross that one off the list. ~~Only question mark is therefore on DisabledAlgorithms.java which is not a regression from 8u432~~ EDIT: https://github.com/adoptium/aqa-tests/issues/5915 covers an upstream change that caused this to stop working


Some resources for the ValidateISO4217 test failure - there is nothing recent but the comments might indicate a test material sync issue:

  • https://github.com/adoptium/aqa-tests/issues/678
  • https://github.com/adoptium/aqa-tests/issues/3640#issuecomment-1151717530

CancelledLockLoops was noted as a one-off failure on Linux/arm32 JDK11 at https://github.com/adoptium/aqa-tests/issues/5691#issuecomment-2418402821 in the October 2024 release cycle - hopefully that will be a on-off here too..

SSLConfigFilePermissionTest is the issue with the port being blocked by the dhcpagent on Solaris/x64 so not considered a functional problem.

sxa avatar Apr 10 '25 10:04 sxa

Note: java/util/Currency/ValidateISO4217.java is an upstream platform-neutral issue. Exclude requested at https://github.com/adoptium/aqa-tests/pull/6179 so that can be ignored for the purposes of this.

The other two failures are also new since the last GA so may be related to the ValidateISO4217 failure. I have also verified at https://ci.adoptium.net/job/Grinder/12890/console that these are NOT specific to Solaris.

sxa avatar Apr 11 '25 12:04 sxa

Easy way to summarise the consoleText log: awk '/FAILED test targets:/,/TOTAL:/'

sxa avatar Apr 11 '25 13:04 sxa

Additional work to allow downloading of restricted release build jobs is being done under https://github.com/adoptium/temurin-build/pull/4253 by copying the job artefacts to the proxy machine and then 'scp'ing them across to the target. This will close off the final action for the test jobs.

Both simpletest jobs have been adjusted to include a copyartifacts step and take UPSTREAM_JOB_NAME/UPSTREAM_JOB_NAME as parameters.

sxa avatar Sep 11 '25 11:09 sxa

This is complete but apparently I didn't close the issue previously :-)

sxa avatar Nov 26 '25 15:11 sxa