temurin-build
temurin-build copied to clipboard
fix: smoke test not working for jdk19/20 on alpine x64
- suspect it could be ant with underlying jdk version 19/20 not working well with target "copy"
- this PR changes:
- add check if testng.xml exist before copy (this is a known issue in ant that if source file is missing, copy could hang)
- use cp command than copy target (we are running either on linux, or windows with cygwin or mac, so cp should work in all three cases)
- only testng.xml is needed, not build.xml or playlist.xml or any makefile for smoketest. But to make it align with old code, do cp on all matched *.xml files
P.S: this PR is only trying to fix the hanging smoke test job from Jenkins. It has not solid evidence that Ant is not working with jdk19/20 and why it is only seen on certain jobs on certain platform+OS
test run: https://ci.adoptopenjdk.net/job/Grinder/5548/console is on jdk 19 alpine x64 https://ci.adoptopenjdk.net/job/Grinder/5552/console is windows jdk19 x64 https://ci.adoptopenjdk.net/job/Grinder/5553/console is mac jdk20 aarch64
Fix: https://github.com/adoptium/temurin-build/issues/3031
If ant copy does not work for jdk19/20 on alpine x64, we should report this issue to ant.
If ant copy is not working on x64 alpine-linux, how are the other test jobs running successfully, example: https://ci.adoptopenjdk.net/job/Test_openjdk20_hs_sanity.functional_x86-64_alpine-linux/11/consoleFull
Based on the console outputs, some ant targets that also copy files run successfully in the smoke job...
22:33:03 dist_functional:
22:33:03 [copy] Copying 2 files to /home/jenkins/workspace/build-scripts/jobs/jdk19/jdk19-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional
22:33:03
22:33:03 dist:
22:33:04 [jar] Building jar: /home/jenkins/workspace/build-scripts/jobs/jdk19/jdk19-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage/BuildAndPackagingTests.jar
22:33:04 [copy] Copying 3 files to /home/jenkins/workspace/build-scripts/jobs/jdk19/jdk19-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage
Cancelling nested steps due to timeout
08:30:24 Sending interrupt signal to process
08:30:28 143
dist_functional target successfully copies 2 files to the workdir, then the dist target builds the jar successfully and hangs when trying to move the 2 files plus the jar file. Is there something unusual about that jar file?
I see that in other test runs on x64 alpine-linux ant is able to build and copy xml and jar files, example from https://ci.adoptopenjdk.net/job/Test_openjdk20_hs_sanity.functional_x86-64_alpine-linux/11/consoleFull
07:27:23 dist:
07:27:23 [jar] Building jar: /home/jenkins/workspace/Test_openjdk20_hs_sanity.functional_x86-64_alpine-linux/jvmtest/functional/Java12andUp/GeneralTest.jar
07:27:23 [copy] Copying 3 files to /home/jenkins/workspace/Test_openjdk20_hs_sanity.functional_x86-64_alpine-linux/jvmtest/functional/Java12andUp
07:27:23
Perhaps try a run with ant -verbose or ant -debug to see more about what is really happening?
this is the part i am confused why it only happens on smoketest but not the other tests on jdk19/20
GH action is using a different runner with adoptopenjdk/alpine3_build_image on alpine 3.16 which is not the one we use to setup jenkins agnet.
i have a run with -d -verbose
https://ci.adoptopenjdk.net/job/Grinder/5556/console
16:23:47 [jar] Location: /home/jenkins/workspace/Grinder/aqa-tests/functional/buildAndPackage/build.xml:53:
16:23:47 [copy] Copying 3 files to /home/jenkins/workspace/Grinder/jvmtest/functional/buildAndPackage
16:23:47 [copy] Copying /home/jenkins/workspace/Grinder/aqa-tests/functional/buildAndPackage/build.xml to /home/jenkins/workspace/Grinder/jvmtest/functional/buildAndPackage/build.xml
Aborted by [Wen Zhou](https://ci.adoptopenjdk.net/user/zdtsw)
16:54:55 Sending interrupt signal to process
Not sure how the Github runner info is relevant as your Grinder is running on test-docker-alpine314-x64-1 and so are the smoke tests, not run on github runners. Is test-docker-alpine314-x64-1 setup using the alpine3_build_image?
Not sure how the Github runner info is relevant as your Grinder is running on test-docker-alpine314-x64-1 and so are the smoke tests, not run on github runners. Is test-docker-alpine314-x64-1 setup using the alpine3_build_image?
we do not have problem to have these smoke tests run in GH action on all jdk versions. all of them are running from alpine_3_build_image these jenkins agent (e.g test-docker-alpine314-x64-1 are setup by ansible playbook based on different dockerfiles(alpine3.11 3.12 and 3.14)
Thanks @zdtsw!
I see now it mentioned in the issue that this PR is intended to fix, may I ask that you use Closes or Fixes keyword in your PRs so that one can easily locate the issue it relates to? I missed seeing the Ref to https://github.com/adoptium/temurin-build/issues/3031
It seems that would be quite a relevant difference.
XML fix looks better, do we know historically why we shipped the mk files.
not really sure. maybe when added smoke into temurin-build did the same as aqa-tests was doing: https://github.com/adoptium/aqa-tests/blob/master/functional/security/build.xml#L57
https://github.com/adoptium/aqa-tests/blob/master/functional/security/build.xml#L57
git blame tells me @smlambert authored that back in 2020 ;-) - Hey Shelley, any chance you recall this from memory lane?
Likely to handle future cases where we may choose to handle nested test dirs (as we do for other types of testing and knowing we intend to continue adding smoke tests).
As discussed in a call this morning, I do not want this PR merged as a workaround. I want to continue to dig to uncover the root cause of the problem first, as this PR is a big hack around a real problem that I'd like us to try a bit longer to understand and resolve before working around the unknown.
diff-ing a failing jdk20 run versus a working jdk18u smoke test run to the point of the ant dist target where jdk20 hangs to see if anything looks off or can tell us more:
Failing jdk20 run | Passing jdk18u run |
---|---|
Running on test-docker-alpine314-x64-2 in /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests | Running on test-docker-alpine311-x64-1 in /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests |
NODE_LABELS=ci.role.test hw.arch.x86 sw.os.alpine-linux test-docker-alpine314-x64-2 | NODE_LABELS=AMD ci.role.test hw.arch.x86 sw.os.alpine-linux test-docker-alpine311-x64-1 |
00:27:13 =JAVA VERSION OUTPUT BEGIN=openjdk version "20-beta" 2023-03-21 OpenJDK Runtime Environment Temurin-20+12-202208310337 (build 20-beta+12-202208310337) OpenJDK 64-Bit Server VM Temurin-20+12-202208310337 (build 20-beta+12-202208310337, mixed mode, sharing) =JAVA VERSION OUTPUT END= =RELEASE INFO BEGIN= IMPLEMENTOR="Eclipse Adoptium" IMPLEMENTOR_VERSION="Temurin-20+12-202208310337" JAVA_VERSION="20" JAVA_VERSION_DATE="2023-03-21" | 20:21:33 =JAVA VERSION OUTPUT BEGIN=openjdk version "18.0.2.1-beta" 2022-08-18 OpenJDK Runtime Environment Temurin-18.0.2.1+1-202208312342 (build 18.0.2.1-beta+1-202208312342) OpenJDK 64-Bit Server VM Temurin-18.0.2.1+1-202208312342 (build 18.0.2.1-beta+1-202208312342, mixed mode, sharing) =JAVA VERSION OUTPUT END= =RELEASE INFO BEGIN= IMPLEMENTOR="Eclipse Adoptium" IMPLEMENTOR_VERSION="Temurin-18.0.2.1+1-202208312342" JAVA_VERSION="18.0.2.1" JAVA_VERSION_DATE="2022-08-18" |
Could not add alternate for '/home/jenkins/openjdk_cache': reference repository '/home/jenkins/openjdk_cache' is not a local repository. Updating files: 52% (5270/10087) Updating files: 53% (5347/10087) Updating files: 99% (9987/10087) Updating files: 100% (10087/10087) Updating files: 100% (10087/10087), done. check OpenJ9 Repo sha /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/scripts/getSHA.sh --repo_dir /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/aqa-tests/openj9 --output_file /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/SHA.txt Check sha in /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/aqa-tests/openj9 and store the info in /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/SHA.txt | Could not add alternate for '/home/jenkins/openjdk_cache': reference repository '/home/jenkins/openjdk_cache' is not a local repository. check OpenJ9 Repo sha /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/scripts/getSHA.sh --repo_dir /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/aqa-tests/openj9 --output_file /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/SHA.txt Check sha in /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/aqa-tests/openj9 and store the info in /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/aqa-tests/TKG/SHA.txt |
This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi | This is perl 5, version 30, subversion 3 (v5.30.3) built for x86_64-linux-thread-multi |
cpuCores : 56 | cpuCores : 48 |
GNU Make 4.3 | GNU Make 4.2.1 |
dist: [jar] Building jar: /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage/BuildAndPackagingTests.jar [copy] Copying 3 files to /home/jenkins/workspace/build-scripts/jobs/jdk/jdk-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage Aborted by Wen Zhou | dist: [jar] Building jar: /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage/BuildAndPackagingTests.jar [copy] Copying 3 files to /home/jenkins/workspace/build-scripts/jobs/jdk18u/jdk18u-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage clean: ... continues to success |
Notables / questions:
-
we should try running jdk20 smoke tests explicitly on test-docker-alpine311-x64-1 to see if it behaves differently (Grinder/5577) looks like also hangs
-
why does jdk18u job get sent to NODE_LABELS=AMD && test labels ? is it explicitly set somewhere?
-
for jdk20, it is considered tip? uses job and dir names as jdk versus jdkXu ? does this matter to smoke tests? does not to other tests. (do NOT expect this to be an issue, but noting it in case)
-
perl and make versions are different if it matters
-
what files are getting updated in the jdk20 run?
-
also rerun -d -verbose https://ci.adoptopenjdk.net/job/Grinder/5556/ but on a known to pass jdk18u run in Grinder/5578, to see the diff
- git versions also differ on different alpine machines (git version 2.32.0 on git version 2.26.3 versus git version 2.26.3 on test-docker-alpine312-x64-2)
close this PR, looks like both jdk19 and 20 smoketest work on alpine x64 since 24th Nov.