azure-vm-agents-plugin icon indicating copy to clipboard operation
azure-vm-agents-plugin copied to clipboard

Spot VM evictions are not reported to Jenkins, so builds hang and status not reported

Open tchrischan opened this issue 4 years ago • 9 comments

Version report

Jenkins and plugins versions report:

Result
Jenkins: 2.303.2
OS: Linux - 4.19.0-16-cloud-amd64
---
ace-editor:1.1
ant:1.11
antisamy-markup-formatter:2.1
apache-httpcomponents-client-4-api:4.5.13-1.0
authentication-tokens:1.4
authorize-project:1.4.0
azure-acs:1.0.4
azure-ad:180.v8b1e80e6f242
azure-artifact-manager:86.va2aa4b1038c7
azure-commons:1.1.3
azure-container-registry-tasks:0.6.5
azure-credentials:182.v3ccd4a755864
azure-iot-edge:2.0.0
azure-sdk:23.v5682688d0eef
azure-vm-agents:783.v58077630847d
basic-branch-build-strategies:1.3.2
blueocean:1.24.3
blueocean-autofavorite:1.2.4
blueocean-bitbucket-pipeline:1.24.3
blueocean-commons:1.24.6
blueocean-config:1.24.3
blueocean-core-js:1.24.3
blueocean-dashboard:1.24.3
blueocean-display-url:2.4.0
blueocean-events:1.24.3
blueocean-git-pipeline:1.24.3
blueocean-github-pipeline:1.24.3
blueocean-i18n:1.24.3
blueocean-jira:1.24.3
blueocean-jwt:1.24.3
blueocean-personalization:1.24.3
blueocean-pipeline-api-impl:1.24.3
blueocean-pipeline-editor:1.24.3
blueocean-pipeline-scm-api:1.24.3
blueocean-rest:1.24.6
blueocean-rest-impl:1.24.3
blueocean-web:1.24.3
bootstrap4-api:4.5.3-1
bootstrap5-api:5.1.0-3
bouncycastle-api:2.20
branch-api:2.6.5
build-timeout:1.20
caffeine-api:2.9.2-29.v717aac953ff3
checks-api:1.7.2
cloud-stats:0.27
cloudbees-bitbucket-branch-source:2.9.6
cloudbees-folder:6.15
cmakebuilder:2.6.3
cobertura:1.16
code-coverage-api:1.4.1
command-launcher:1.5
configuration-as-code:1.54
copyartifact:1.46
credentials:2.6.1
credentials-binding:1.27
data-tables-api:1.10.25-3
discard-old-build:1.05
display-url-api:2.3.5
docker-build-step:2.6
docker-commons:1.17
docker-java-api:3.1.5.2
docker-plugin:1.2.1
docker-workflow:1.25
durable-task:1.35
echarts-api:5.1.2-11
email-ext:2.81
extended-read-permission:3.2
favorite:2.3.2
font-awesome-api:5.15.4-1
forensics-api:1.3.0
git:4.9.0
git-client:3.10.0
git-server:1.9
github:1.32.0
github-api:1.123
github-branch-source:2.10.2
github-checks:1.0.8
github-oauth:0.33
github-pr-coverage-status:2.1.1
github-pullrequest:0.2.8
global-slack-notifier:1.5
google-oauth-plugin:1.0.2
gradle:1.36
handlebars:1.1.1
handy-uri-templates-2-api:2.1.8-1.0
htmlpublisher:1.25
icon-shim:2.0.3
influxdb:3.0.2
jackson2-api:2.12.3
jaxb:2.3.0.1
jdk-tool:1.4
jenkins-design-language:1.24.3
jira:3.1.3
jjwt-api:0.11.2-9.c8b45b8bb173
jobConfigHistory:2.27
jquery-detached:1.2.1
jquery3-api:3.6.0-2
jsch:0.1.55.2
junit:1.48
kubernetes:1.28.5
kubernetes-cd:2.3.1
kubernetes-client-api:4.11.1
kubernetes-credentials:0.7.0
ldap:2.3
llvm-cov:1.0.0
lockable-resources:2.10
mailer:1.34
mapdb-api:1.0.9.0
matrix-auth:2.6.6
matrix-project:1.19
mercurial:2.12
metrics:4.0.2.7
momentjs:1.1.1
multibranch-build-strategy-extension:1.0.10
oauth-credentials:0.4
okhttp-api:3.14.9
pam-auth:1.6
pipeline-build-step:2.13
pipeline-github-lib:1.0
pipeline-graph-analysis:1.10
pipeline-input-step:2.12
pipeline-milestone-step:1.3.1
pipeline-model-api:1.7.2
pipeline-model-declarative-agent:1.1.1
pipeline-model-definition:1.7.2
pipeline-model-extensions:1.7.2
pipeline-rest-api:2.19
pipeline-stage-step:2.5
pipeline-stage-tags-metadata:1.7.2
pipeline-stage-view:2.19
plain-credentials:1.7
plugin-usage-plugin:1.1
plugin-util-api:2.4.0
popper-api:1.16.0-7
popper2-api:2.9.3-1
pubsub-light:1.13
resource-disposer:0.14
scm-api:2.6.5
script-security:1.78
slack:2.48
snakeyaml-api:1.29.1
sse-gateway:1.24
ssh-agent:1.23
ssh-credentials:1.19
ssh-slaves:1.32.0
sshd:3.0.3
structs:1.23
subversion:2.13.2
timestamper:1.11.8
token-macro:266.v44a80cf277fd
trilead-api:1.0.13
variant:1.4
windows-azure-storage:355.v4da08e72a251
windows-slaves:1.7
workflow-aggregator:2.6
workflow-api:2.46
workflow-basic-steps:2.23
workflow-cps:2.93
workflow-cps-global-lib:2.17
workflow-durable-task-step:2.37
workflow-job:2.41
workflow-multibranch:2.26
workflow-scm-step:2.13
workflow-step-api:2.24
workflow-support:3.8
ws-cleanup:0.38
  • What Operating System are you using (both controller, and any agents involved in the problem)?
Paste here

Reproduction steps

  • Configure cloud VMs with Spot instance box checked
  • Run some builds
  • Eventually, a build will run far past the expected completion time because the VM was evicted.

Results

Expected result:

The build should be reported as FAILURE. At least if it is marked failed, we can have the pipeline re-run it. Ideally, eviction would deallocate the VM and Jenkins could allocate a new spot VM with the same disk and restart the failed stage.

Actual result:

Build hangs indefinitely until aborted. Logs report following: Connection was broken

java.io.EOFException
	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872)
	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367)
	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
	at hudson.remoting.Command.readFrom(Command.java:142)
	at hudson.remoting.Command.readFrom(Command.java:128)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)

tchrischan avatar Nov 04 '21 19:11 tchrischan

Bumping this issue.

I believe this is due to this check during cleanup:

// If the machine is not idle, don't do anything.
// Could have been taken offline by the plugin while still running
// builds.
if (!azureComputer.isIdle()) {
      continue;
}

As far as I can tell, even if a spot node has been evicted, this check will prevent any jobs still running on the agent from being deleted causing the job to hang indefinitely (until someone manually deletes the agent from Jenkins).

Potentially need to move the check from further down into/above this idle check

// Check if the virtual machine exists.  If not, it could have been
// deleted in the background.  Remove from Jenkins if that is the case.
if (!AzureVMManagementServiceDelegate.virtualMachineExists(agentNode)) {

Unless there is a reason to keep a spot node around in Jenkins even if it has been deleted in Azure?

Edit: Formatting, grammar

domazaris avatar Oct 03 '22 01:10 domazaris

@jglick I think you were doing some work in this area to make it easier to handle spot evictions in cloud providers

Any tips?

timja avatar Oct 03 '22 07:10 timja

Well, you can use the new

retry(count: 2, conditions: [agent()]) {
  node(…) {
    // …
  }
}

idiom, which will retry the node block if it gets killed for a recognized reason—cases where the behavior otherwise is that the build fails/aborts with an agent-related error. If the cloud plugin fails to properly terminate the node to begin with then this will not work. Normally the channel pinger ought to abort on the controller side at some point even if the cloud plugin does nothing special, though.

jglick avatar Oct 04 '22 13:10 jglick

Yes, I was very happy with the new agent/retry functionality. This normally works really well, but when a long running sh command is running and a spot node gets reclaimed/removed by Azure, the job will hang (seemingly) indefinitely. I have tried leaving them for multiple hours before manually deleting the agent via Jenkins. Once I have manually deleted the agent in Jenkins, the job will retry and resume correctly, saving a rebuild of the whole job.

domazaris avatar Oct 04 '22 21:10 domazaris

With an SSH launcher this appears to work fine:

08:33:16  + sleep 5m
08:34:32  Cannot contact no-availability1646d0: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
08:35:43  Agent no-availability1646d0 was deleted; cancelling node body
08:35:43  Could not connect to no-availability1646d0 to send interrupt signal to process
08:35:43  [Pipeline] }
08:35:43  [Pipeline] // node
08:35:43  [Pipeline] }
08:35:44  [Pipeline] // stage
08:35:44  [Pipeline] }
08:35:44  Agent was removed
08:35:44  org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: f9d1cf35-4b9a-48b9-b1c5-8cf0a2739b23
08:35:44  Retrying

Not long after the agent is evicted this is logged:

Cannot contact no-availability1646d0

1 minute later some timeout is hit, then the retry kicks in fine:

retry(count: 2, conditions: [agent()]) {

It could be quicker but it seems to work fine.

I'll try later with an inbound agent.

timja avatar Feb 23 '25 08:02 timja

With inbound agent it was slower to cancel the task but it got there after 8 minutes and correctly retried:

az vm simulate-eviction -g tim-azure-vm-agents --name inbound1328e0
12:33:41  + sleep 5m
12:37:52  Cannot contact inbound1328e0: java.lang.InterruptedException
12:45:05  Agent inbound1328e0 was deleted; cancelling node body
12:45:05  Could not connect to inbound1328e0 to send interrupt signal to process

@jglick is there any plugin that handles spot instances in a better way do you know?

I searched ec2 plugin and kubernetes and I didn't see anything.

I found on stackoverflow that you can have a process on the VM hit an endpoint every 1 second to look for an eviction notice and handle that in the application or you could configure an alert that calls a webhook when a spot agent is evicted.

I guess the question is it worth it to try handle the eviction quicker or is it ok how it is?

timja avatar Feb 23 '25 13:02 timja

is there any plugin that handles spot instances in a better way do you know?

Yes. You need to extend the right superclass as per https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/372. For example https://github.com/jenkinsci/ec2-plugin/pull/1015. There is no need to use any vendor-specific API to watch for eviction notices (the node should be removed and the block retried the moment the VM actually exits), though you could if you wished (to avoid wasting a couple minutes when you know the agent is going to die soon anyway).

jglick avatar Mar 05 '25 19:03 jglick