google-compute-engine-plugin
google-compute-engine-plugin copied to clipboard
Jobs on preempted VMs hang indefinitely until manually cancelled
Jenkins and plugins versions report
Environment
Paste the output here
What Operating System are you using (both controller, and any agents involved in the problem)?
Ubuntu
Reproduction steps
- Create a pipeline where steps are run on preemptible instances
- run the pipeline until an instance gets preempted while in the middle of a step
Expected Results
When a node is preempted, jobs / steps on the node should auto-cancel and be restarted elsewhere.
Actual Results
- the node is marked as "offline"
- the console output shows the preemption handler getting the preemption event;
- The step is still treated as "in progress"--it shows with the progress display in the sidebar, the build time clock keeps ticking, etc, despite there being no hope of ever completing
Anything else?
No response
I also have the same issue with the following implementation :
- Jenkins controller
- OS : Debian 10
- Java 11
- Jenkins 2.375.4 (official LTS Docker image + some plugins)
- google-compute-engine:4.3.16
- google-metadata-plugin:0.4
- google-oauth-plugin:1.0.11
- Jenkins image template agent on GCE
- OS : Ubuntu 22.04 LTS
- Java 11
- Remoting : managed with SSH (but it is 3077.vd69cf116da_6f)
Jenkinsfile used :
pipeline {
agent {
label 'preemptible'
}
options {
timestamps()
}
stages {
stage('Test 1') {
steps {
echo 'Hello World'
sh 'hostname'
sh '''#!/bin/bash
COUNT=0
while [ $COUNT -le 100 ]
do
echo "Test # $COUNT"
sleep 1
COUNT=$(( $COUNT + 1 ))
done
'''
}
}
stage('Test 2') {
steps {
echo 'Hello Buddy'
sh 'uname -a'
sleep time: 5, unit: 'MINUTES'
}
}
}
}
I simulated a preemption with the following command in the middle of the bash count loop :
gcloud compute instances simulate-maintenance-event --zone europe-west1-b --project my_project the_agent_gce_instance_name
Agent log and build log attached.
Steps to reproduce :
- Create a pipeline job with the pipeline declarative code above
- When the job is in the middle of a step (like the bash script), simulate a maintenance event with the above command.
- The agent is getting the preemption event (as seen in the agent log) but the build is never failed or aborted and can last for days without a manual abort.
- After the manual abort, the job is relaunched on another agent.
Hello, any updates on this ? This is quite annoying because jobs hangs and are not restarted elsewhere. Do you need additional logs or troubleshooting ?