google-compute-engine-plugin icon indicating copy to clipboard operation
google-compute-engine-plugin copied to clipboard

Jobs on preempted VMs hang indefinitely until manually cancelled

Open rluckom-guardian opened this issue 2 years ago • 2 comments

Jenkins and plugins versions report

Environment
Paste the output here

What Operating System are you using (both controller, and any agents involved in the problem)?

Ubuntu

Reproduction steps

  1. Create a pipeline where steps are run on preemptible instances
  2. run the pipeline until an instance gets preempted while in the middle of a step

Expected Results

When a node is preempted, jobs / steps on the node should auto-cancel and be restarted elsewhere.

Actual Results

  1. the node is marked as "offline"
  2. the console output shows the preemption handler getting the preemption event;
  3. The step is still treated as "in progress"--it shows with the progress display in the sidebar, the build time clock keeps ticking, etc, despite there being no hope of ever completing

Anything else?

No response

rluckom-guardian avatar Jul 05 '23 21:07 rluckom-guardian

I also have the same issue with the following implementation :

  1. Jenkins controller
  • OS : Debian 10
  • Java 11
  • Jenkins 2.375.4 (official LTS Docker image + some plugins)
  • google-compute-engine:4.3.16
  • google-metadata-plugin:0.4
  • google-oauth-plugin:1.0.11
  1. Jenkins image template agent on GCE
  • OS : Ubuntu 22.04 LTS
  • Java 11
  • Remoting : managed with SSH (but it is 3077.vd69cf116da_6f)

Jenkinsfile used :

pipeline {
    agent {
        label 'preemptible'
    }
    options {
        timestamps()
    }
    stages {
        stage('Test 1') {
            steps {
                echo 'Hello World'
                sh 'hostname'
                sh '''#!/bin/bash
                    COUNT=0
                    while [ $COUNT -le 100 ] 
                    do
                        echo "Test # $COUNT"
                        sleep 1
                        COUNT=$(( $COUNT + 1 ))
                    done
                '''
            }
        }
        stage('Test 2') {
            steps {
                echo 'Hello Buddy'
                sh 'uname -a'
                sleep time: 5, unit: 'MINUTES'
            }
        }        
    }
}

I simulated a preemption with the following command in the middle of the bash count loop :

gcloud compute instances simulate-maintenance-event --zone europe-west1-b --project my_project the_agent_gce_instance_name

Agent log and build log attached.

Steps to reproduce :

  • Create a pipeline job with the pipeline declarative code above
  • When the job is in the middle of a step (like the bash script), simulate a maintenance event with the above command.
  • The agent is getting the preemption event (as seen in the agent log) but the build is never failed or aborted and can last for days without a manual abort.
  • After the manual abort, the job is relaunched on another agent.

build.log jenkins_agent.log

Fabiosilvero avatar Oct 23 '23 13:10 Fabiosilvero

Hello, any updates on this ? This is quite annoying because jobs hangs and are not restarted elsewhere. Do you need additional logs or troubleshooting ?

Fabiosilvero avatar Dec 26 '23 09:12 Fabiosilvero