ci-jenkins-pipelines
ci-jenkins-pipelines copied to clipboard
Refactor jck wait logic into parallel jobs
To help report jck status asynchronously, and integrate better with TRSS, update the wait logic to use parallel wait jobs for each jck group. Caution needs to be done in the implementation that "long-running" execution threads are not used especially on jenkins-worker.
- [x] Step 1: Disable code that groups different jck targets together.
- [x] Step 2: Run different jck targets in separate, concurrent pipeline stages.
- This may be achieved through step 1. Make sure.
- ~~Step 3: Just before a JCK job is remotely triggered, produce a rerun link.~~
- ~~Note to self: Sophia may have already solved this. Check for PRs and ping her if needed.~~
- Sophia did solve this. Go Sophia! 😄
- [x] Step 4: Change the "wait" stage that monitors the TCK status so that:
- [x] 4.1: We monitor each target in a separate stage.
- [x] 4.2: We release the jenkins executor while we sleep between status checks (20 mins per loop).
- [x] 4.3: We "resume" the "paused" stage when we check the status of a target.
- Or maybe create an untracked stage? Explore options.
- [x] 4.4: Report status using the same format as other aqa-test jobs for easy parsing.
Step 3: Just before a JCK job is remotely triggered, produce a rerun link.
This is already done (thanks Sophia). Here are the details:
A public link is generated to retrigger the remote AQA_Test_Pipeline via the public AQA_Test_Pipeline.
See an example:
-
Build job that triggers remote TC AQA_Test_Pipeline: https://ci.adoptium.net/job/build-scripts/job/jobs/job/release/job/jobs/job/jdk21u/job/jdk21u-release-alpine-linux-aarch64-temurin/20/
-
Generates a RERUN JCK TESTS: link sanity.jck,extended.jck,special.jck jdk21 : jdk-21.0.9+10_adopt : aarch64_alpine-linux : sanity.jck,extended.jck,special.jck which is the public AQA_Test_Pipeline remotely triggering the private TC AQA_Test_Pipeline with the correct parameters as the original remote trigger (and skips having to rebuild to relaunch tests).
So as long as we split them apart (Steps 1 / 2), we should have better granularity for relaunching.
This issue covers separating the tracking for jck remote jobs into separate jenkins stages for more detailed status tracking
After investigation, it seems that tracking 4 different remote jobs via 4 different stages requires 4 different executors.
Executors are used for many different parts of the release process, and we currently have issues with blocking when we run out of executors.
So hogging 4 executors per build will only make this bottleneck worse.
Here are a few concept-level designs for restructuring the coder that monitors the remote TCK jobs:
Option 1
We leave an agentless "gap" between the "wait" and "check remote job" stages, complete with time-checking logic for efficient use of the delay in getting a new executor.
Pro: This allows other tasks to claim the executor and use it during the wasted time we would usually spend "sleeping". Con: We may end up waiting more than 20 minutes between status checks if there's a dearth of executors.
Pseudo-code example:
node("jenkins-master") {
agent any
stage("extended.jck remote job tracker") {
loop start
stage("agent-free inner stage") {
agent none
stage("check remote job") {
agent any
** Fetch remote job object/data. **
** Check extended.jck job status. **
** Store remote job object/data somewhere. **
** Store current time. **
}
// Agent-free gap where the check stage releases its executor and waits for a new one.
stage("wait") {
agent any
** Fetch time stored in check stage **
** If it was >20 minutes ago, continue. Else wait.**
}
}
**Fetch job status**
**if status = completed then end loop**
}
}
Option 2 We use the primary executor to track the status of all jck jobs. When one completes, we trigger a stage whose only purpose is to complete with the relevant status.
Pros: Less convoluted than Option 2. Cons: Doesn't show "In-progress" targets.
Any thoughts?
@sophia-guo especially. :)
Caution needs to be done in the implementation that "long-running" execution threads are not used especially on jenkins-worker.
After investigation, it seems that tracking 4 different remote jobs via 4 different stages requires 4 different executors.
I think there is no reason to use jenkins-worker if it causes any issues. Initially it is used for simplicity, but since it has now become a bottleneck, the job can be moved to any other workable agent, like 'ci.jenkins.test'.
I don't think in-process targets is necessary. Right now the process is
remoteTrigger --> tck tests running and all aqa-tests is running -- -- > ( when all aqa-tests fininsh) query tck test status. That means even now not the whole in-process is showed in the status.
The tck job stage 'status' was mainly added as an easy way to record and show the remote job results due to it's asynchronism. So I think we can poll query sequentially and once a job completes it can be marked as done. After all jobs are finished using stage to report the status for each one, which doesn't need to be parallel at all. Similar to @adamfarley ?
... So I think we can poll query sequentially and once a job completes it can be marked as done. After all jobs are finished using stage to report the status for each one, which doesn't need to be parallel at all.
Good idea. Will give that a shot. Thank you.