temurin-build icon indicating copy to clipboard operation
temurin-build copied to clipboard

jdk(20) alpine x64 linux smoke test hanging

Open andrew-m-leonard opened this issue 3 years ago • 12 comments

https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/

andrew-m-leonard avatar Jul 08 '22 08:07 andrew-m-leonard

seems it is not only jdk20 has this problem https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19/job/jdk19-alpine-linux-x64-temurin_SmokeTests/17/console

00:39:11.008       [copy] Copying 3 files to /home/jenkins/workspace/build-scripts/jobs/jdk19/jdk19-alpine-linux-x64-temurin_SmokeTests/jvmtest/functional/buildAndPackage
Cancelling nested steps due to timeout
10:36:28.088  Sending interrupt signal to process

zdtsw avatar Aug 23 '22 07:08 zdtsw

for the ones marked as green e.g https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk18u/job/jdk18u-alpine-linux-x64-temurin_SmokeTests/47/console and https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/32/console logs are chopped, only can see from raw text

zdtsw avatar Aug 23 '22 07:08 zdtsw

still do not understand why in GH action test work on both jdk19/20 e.g https://github.com/adoptium/temurin-build/runs/8045139038?check_suite_focus=true

seems the GHA is using adoptopenjdk/alpine3_build_image image but jenkins docker alpine agent is using different dockerfile from infrastructure/ansible/playbooks/AdoptOpenJDK_Unix_Playbook/roles/DockerStatic/Dockerfiles

zdtsw avatar Aug 30 '22 10:08 zdtsw

so i did a test https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19/job/jdk19-alpine-linux-x64-temurin_SmokeTests/23/console the change is to commend out block

<!-- <copy todir="${DEST}">
           <fileset dir="${src}/../" includes="*.xml"/>
           <fileset dir="${src}/../" includes="*.mk"/>
       </copy> -->

in build.xml and seems the hanging part is the copy 3 files

zdtsw avatar Aug 31 '22 07:08 zdtsw

~~since i cannot abort on-going pipeline, i made a new test on grinder https://ci.adoptopenjdk.net/job/Grinder/5519/console basically it moved the "copy" into a dedicated target and with check if files are there to fail the job instead of hanging there. i think it is a known issue with ant to do copy if the source files are not existing, then the target just hang there. shows testng.xml (or any *.xml is missing)~~

with more test it seems not about the file is missing but the does not work at all. does not matter it is matching file or explicit set file name.

zdtsw avatar Aug 31 '22 09:08 zdtsw

https://ci.adoptopenjdk.net/job/Grinder/5541/console is for jdk18 with my test branch issue/3031 with explicit copy two xml files which works but it does not work on jdk19 https://ci.adoptopenjdk.net/job/Grinder/5534/console

zdtsw avatar Aug 31 '22 12:08 zdtsw

https://ci.adoptopenjdk.net/job/Grinder/5548/console is on jdk 19 when i change from "copy" target to "executable of cp" then it works.

zdtsw avatar Aug 31 '22 12:08 zdtsw

https://ci.adoptopenjdk.net/job/Grinder/5552/console is the same code but run for windows jdk19 https://ci.adoptopenjdk.net/job/Grinder/5553/console for mac jdk20

Could it be the problem that jdk19/20 does not work well with ant 1.10.9 for the copy target?

4:58:47  Run D:\jenkins\workspace\Grinder/openjdkbinary/j2sdk-image/bin/java -version
14:58:47  =JAVA VERSION OUTPUT BEGIN=
14:58:47  openjdk version "19-beta" 2022-09-20

zdtsw avatar Aug 31 '22 13:08 zdtsw

Based on all of the information we have gathered so far, here is what we know about this smoke test:

  • Passes on other platforms, only hanging on alpine-linux
  • Passes on jdk18u and earlier, only hanging on jdk19 & jdk20
  • The ant dist target runs fine and copies things well in other test jobs for alpine-linux jdk19 and jdk20
  • Hangs on all machines labelled ci.role.test&&hw.arch.x86&&sw.os.alpine-linux
  • Passes when run in a github workflow environment

It is because of the fact that this test can run fine in a github workflow environment and that other test jobs do not hang, that it reminded me of one other problem we have been seeing related to alpine-linux that needs to be addressed (which I think is related to or the actual cause of this problem)... the smoke test job does not seem to follow the other test jobs naming convention as evidenced by how it gets displayed in TRSS:

Screen Shot 2022-09-02 at 7 36 48 AM

I believe if we correct that naming issue, we will no longer see this hang. I suspect, but have not confirmed that some dependent ant targets defined in TKG/scripts/build_test.xml must create dirs based off the known platform name x86-64_alpine-linux versus x64_alpine-linux.

smlambert avatar Sep 02 '22 12:09 smlambert

Maybe @renfeiw can comment from TKG perspective.

For TRSS, it sets the platform based on the job name. In this case, we are using 2 different naming conventions for alpine linux platform. It causes a mismatch in the TRSS Grid view as shown above screenshot.

  • the smoke test job name - jdk-alpine-linux-x64-temurin_SmokeTests
  • the regular test job name - Test_openjdk19_hs_sanity.openjdk_x86-64_alpine-linux

This is a known issue https://github.com/adoptium/aqa-test-tools/issues/695

llxia avatar Sep 02 '22 13:09 llxia

Thanks @llxia ! But does it mean, it is just how TRSS presents the result with different naming convention, not really something related to running the test?

zdtsw avatar Sep 02 '22 14:09 zdtsw

What I am wondering is what is this block of code doing for the alpine-linux case for smoke tests: https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/openjdk_build_pipeline.groovy#L193-L198

smlambert avatar Sep 02 '22 14:09 smlambert

some findings: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19u/job/jdk19u-alpine-linux-x64-temurin_SmokeTests/7/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19u/job/jdk19u-alpine-linux-x64-temurin_SmokeTests/9/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/72/ https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/71/

these are the all "green" ones. the common part of these builds are , they are running on test-docker-alpine314-x64-1-NEW test-docker-alpine314-x64-2-NEW could that be the alpine314 works but not alpine312 or it is the -NEW nodes? so I did a test to bind to an old alpine314: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-alpine-linux-x64-temurin_SmokeTests/75/console . it is hanging in the same place

=>only two new nodes test-docker-alpine314-x64-1-NEW and test-docker-alpine314-x64-2-NEW work => underlying VM running container from ubuntu2004 to ubuntu2204

zdtsw avatar Nov 17 '22 10:11 zdtsw

close this issue, both jdk19 and 20 smoketest work on alpine x64 since 24th Nov. the problem is related to the jenkins agent we use. once they are replaced to the new ones, all go well.

zdtsw avatar Dec 07 '22 07:12 zdtsw