temurin-build
temurin-build copied to clipboard
jdk(20) alpine x64 linux smoke test hanging
https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/
seems it is not only jdk20 has this problem https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19/job/jdk19-alpine-linux-x64-temurin_SmokeTests/17/console
00:39:11.008 [copy] Copying 3 files to /home/jenkins/workspace/build-scripts/jobs/jdk19/jdk19-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage
Cancelling nested steps due to timeout
10:36:28.088 Sending interrupt signal to process
for the ones marked as green e.g https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk18u/job/jdk18u-alpine-linux-x64-temurin_SmokeTests/47/console and https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/32/console logs are chopped, only can see from raw text
still do not understand why in GH action test work on both jdk19/20 e.g https://github.com/adoptium/temurin-build/runs/8045139038?check_suite_focus=true
seems the GHA is using adoptopenjdk/alpine3_build_image image but jenkins docker alpine agent is using different dockerfile from infrastructure/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/Dockerfiles
so i did a test https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19/job/jdk19-alpine-linux-x64-temurin_SmokeTests/23/console the change is to commend out block
<!-- <copy todir="${DEST}">
<fileset dir="${src}/../" includes="*.xml"/>
<fileset dir="${src}/../" includes="*.mk"/>
</copy> -->
in build.xml and seems the hanging part is the copy 3 files
~~since i cannot abort on-going pipeline, i made a new test on grinder https://ci.adoptopenjdk.net/job/Grinder/5519/console basically it moved the "copy" into a dedicated target and with check if files are there to fail the job instead of hanging there. i think it is a known issue with ant to do copy if the source files are not existing, then the target just hang there. shows testng.xml (or any *.xml is missing)~~
with more test
it seems not about the file is missing but the
https://ci.adoptopenjdk.net/job/Grinder/5541/console is for jdk18 with my test branch issue/3031 with explicit copy two xml files which works but it does not work on jdk19 https://ci.adoptopenjdk.net/job/Grinder/5534/console
https://ci.adoptopenjdk.net/job/Grinder/5548/console is on jdk 19 when i change from "copy" target to "executable of cp" then it works.
https://ci.adoptopenjdk.net/job/Grinder/5552/console is the same code but run for windows jdk19 https://ci.adoptopenjdk.net/job/Grinder/5553/console for mac jdk20
Could it be the problem that jdk19/20 does not work well with ant 1.10.9 for the copy target?
4:58:47 Run D:\jenkins\workspace\Grinder/openjdkbinary/j2sdk-image/bin/java -version
14:58:47 =JAVA VERSION OUTPUT BEGIN=
14:58:47 openjdk version "19-beta" 2022-09-20
Based on all of the information we have gathered so far, here is what we know about this smoke test:
- Passes on other platforms, only hanging on alpine-linux
- Passes on jdk18u and earlier, only hanging on jdk19 & jdk20
- The ant
disttarget runs fine and copies things well in other test jobs for alpine-linux jdk19 and jdk20 - Hangs on all machines labelled ci.role.test&&hw.arch.x86&&sw.os.alpine-linux
- Passes when run in a github workflow environment
It is because of the fact that this test can run fine in a github workflow environment and that other test jobs do not hang, that it reminded me of one other problem we have been seeing related to alpine-linux that needs to be addressed (which I think is related to or the actual cause of this problem)... the smoke test job does not seem to follow the other test jobs naming convention as evidenced by how it gets displayed in TRSS:
I believe if we correct that naming issue, we will no longer see this hang. I suspect, but have not confirmed that some dependent ant targets defined in TKG/scripts/build_test.xml must create dirs based off the known platform name x86-64_alpine-linux versus x64_alpine-linux.
Maybe @renfeiw can comment from TKG perspective.
For TRSS, it sets the platform based on the job name. In this case, we are using 2 different naming conventions for alpine linux platform. It causes a mismatch in the TRSS Grid view as shown above screenshot.
- the smoke test job name -
jdk-alpine-linux-x64-temurin_SmokeTests - the regular test job name -
Test_openjdk19_hs_sanity.openjdk_x86-64_alpine-linux
This is a known issue https://github.com/adoptium/aqa-test-tools/issues/695
Thanks @llxia ! But does it mean, it is just how TRSS presents the result with different naming convention, not really something related to running the test?
What I am wondering is what is this block of code doing for the alpine-linux case for smoke tests: https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/openjdk_build_pipeline.groovy#L193-L198
some findings: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19u/job/jdk19u-alpine-linux-x64-temurin_SmokeTests/7/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19u/job/jdk19u-alpine-linux-x64-temurin_SmokeTests/9/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/72/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/71/
these are the all "green" ones. the common part of these builds are , they are running on test-docker-alpine314-x64-1-NEW test-docker-alpine314-x64-2-NEW could that be the alpine314 works but not alpine312 or it is the -NEW nodes? so I did a test to bind to an old alpine314: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/75/console . it is hanging in the same place
=>only two new nodes test-docker-alpine314-x64-1-NEW and test-docker-alpine314-x64-2-NEW work => underlying VM running container from ubuntu2004 to ubuntu2204
close this issue, both jdk19 and 20 smoketest work on alpine x64 since 24th Nov. the problem is related to the jenkins agent we use. once they are replaced to the new ones, all go well.