ci-jenkins-pipelines icon indicating copy to clipboard operation
ci-jenkins-pipelines copied to clipboard

Pipeline status not accurate when a test job hit a timeout and enters `ABORTED` state

Open sxa opened this issue 1 year ago • 1 comments

Related: Earlier fix applied to set the pipeline status more accurately - https://github.com/adoptium/ci-jenkins-pipelines/issues/1068

Problem identified after a user in slack reported that 23+36-ea was missing for Alpine/x64 in the release.

  • https://ci.adoptium.net/job/build-scripts/job/openjdk23-pipeline/68/ (showing as ABORTED)

  • The Alpine/x64 Subjob hit a failure state due to running on a broken machine (dockerhost-skytap): 20:01:26 Build [build-scripts » jobs » jdk23 » jdk23-alpine-linux-x64-temurin #15](https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk23/job/jdk23-alpine-linux-x64-temurin/15/) completed: FAILURE

  • It looks like the riscv64 pipeline had an ABORTED state: 00:05:06 Propagating downstream job result: build-scripts/jobs/jdk23/jdk23-linux-riscv64-temurin, Result: ABORTED CopyArtifactsSuccess: true which is presumably what set the overall pipeline status to be aborted.

The BlueOcean view of the pipeline did not pick up on either the failures on Alpine/x64 or riscv64:

image

Two things:

  1. The overall pipelines states was ABORTED rather than FAILED which may not give the best impression of the status for the purposes of reporting int he slack channel and elsewhere.
  2. The riscv64 extended.openjdk jobs appear to be hitting a 25 hours timeout so that will need to be addressed. jdk22 is taking 21-23 hours. jdk23+24 are hitting the timeout

image

sxa avatar Aug 12 '24 09:08 sxa

Interestingly, it appears like it can not retrieve estimated test duration data to calculate how long targets take and in most recent runs, splits into 1 list:

17:14:04  TEST DURATION
17:14:04  ====================================================================================
17:14:04  Total number of tests searched: 83
17:14:04  Number of test durations found: 0
17:14:04  No test duration data found.
17:14:04  (Default duration assigned, executed tests: 40s; not executed tests: 0s.)
17:14:04  ====================================================================================
17:14:04  
17:14:04  Test target is split into 1 lists.
17:14:04  Reducing estimated test running time from 26m40s to 26m40s.

Previous runs, for example this Test_openjdk23_hs_extended.openjdk_riscv64_linux/13 splits into 3 lists when can not find test duration data:

17:01:26  TEST DURATION
17:01:26  ====================================================================================
17:01:26  Total number of tests searched: 93
17:01:26  Number of test durations found: 0
17:01:26  No test duration data found.
17:01:26  (Default duration assigned, executed tests: 40s; not executed tests: 0s.)
17:01:26  ====================================================================================
17:01:26  
17:01:26  Test target is split into 3 lists.
17:01:26  Reducing estimated test running time from 30m40s to 10m40s.
17:01:26  

Will check the test code to see if that is based on what nodes are idle, versus which ones are online.

smlambert avatar Aug 12 '24 13:08 smlambert