azure-vm-agents-plugin icon indicating copy to clipboard operation
azure-vm-agents-plugin copied to clipboard

Windows VMs silently deallocate with pool retention and idle retention time set to 0

Open tchrischan opened this issue 3 years ago • 5 comments

Version report

Jenkins and plugins versions report:

Jenkins: 2.319.1
OS: Linux - 4.19.0-16-cloud-amd64
---
ace-editor:1.1
ant:1.13
antisamy-markup-formatter:2.1
apache-httpcomponents-client-4-api:4.5.13-1.0
authentication-tokens:1.4
authorize-project:1.4.0
azure-acs:1.0.4
azure-ad:185.v3b416408dcb1
azure-commons:1.1.3
azure-container-registry-tasks:0.6.5
azure-credentials:198.vf9c2fdfde55c
azure-sdk:70.v63f6a95999a7
azure-vm-agents:799.va4c741108611
basic-branch-build-strategies:1.3.2
blueocean:1.25.2
blueocean-autofavorite:1.2.4
blueocean-bitbucket-pipeline:1.25.2
blueocean-commons:1.25.2
blueocean-config:1.25.2
blueocean-core-js:1.25.2
blueocean-dashboard:1.25.2
blueocean-display-url:2.4.1
blueocean-events:1.25.2
blueocean-git-pipeline:1.25.2
blueocean-github-pipeline:1.25.2
blueocean-i18n:1.25.2
blueocean-jira:1.25.2
blueocean-jwt:1.25.2
blueocean-personalization:1.25.2
blueocean-pipeline-api-impl:1.25.2
blueocean-pipeline-editor:1.25.2
blueocean-pipeline-scm-api:1.25.2
blueocean-rest:1.25.2
blueocean-rest-impl:1.25.2
blueocean-web:1.25.2
bootstrap4-api:4.6.0-3
bootstrap5-api:5.1.3-3
bouncycastle-api:2.25
branch-api:2.7.0
build-timeout:1.20
caffeine-api:2.9.2-29.v717aac953ff3
checks-api:1.7.2
cloud-stats:0.27
cloudbees-bitbucket-branch-source:2.9.11
cloudbees-folder:6.16
cmakebuilder:4.1.1
cobertura:1.17
code-coverage-api:2.0.4
command-launcher:1.6
configuration-as-code:1.55
credentials:2.6.1
credentials-binding:1.27
data-tables-api:1.11.3-4
discard-old-build:1.05
display-url-api:2.3.5
docker-build-step:2.8
docker-commons:1.17
docker-java-api:3.1.5.2
docker-plugin:1.2.3
docker-workflow:1.26
durable-task:493.v195aefbb0ff2
echarts-api:5.2.2-1
email-ext:2.85
extended-read-permission:3.2
favorite:2.3.2
font-awesome-api:5.15.4-3
forensics-api:1.7.0
git:4.10.0
git-client:3.10.0
git-server:1.10
github:1.34.1
github-api:1.301-378.v9807bd746da5
github-branch-source:2.11.3
github-checks:1.0.13
github-oauth:0.35
github-pr-coverage-status:2.1.1
global-slack-notifier:1.5
google-oauth-plugin:1.0.6
gradle:1.36
handlebars:1.1.1
handy-uri-templates-2-api:2.1.8-1.0
htmlpublisher:1.25
influxdb:3.0.2
jackson2-api:2.13.0-230.v59243c64b0a5
jaxb:2.3.0.1
jdk-tool:1.4
jenkins-design-language:1.25.2
jira:3.1.3
jjwt-api:0.11.2-9.c8b45b8bb173
jobConfigHistory:2.28.1
jquery-detached:1.2.1
jquery3-api:3.6.0-2
jsch:0.1.55.2
junit:1.53
kubernetes:1.30.11
kubernetes-cd:2.3.1
kubernetes-client-api:5.4.1
kubernetes-credentials:0.9.0
ldap:2.3
llvm-cov:1.0.0
lockable-resources:2.10
mailer:1.34
mapdb-api:1.0.9.0
matrix-auth:2.6.9
matrix-project:1.19
mercurial:2.12
metrics:4.0.2.8
momentjs:1.1.1
multibranch-build-strategy-extension:1.0.10
oauth-credentials:0.5
okhttp-api:3.14.9
pam-auth:1.6
pipeline-build-step:2.15
pipeline-github-lib:1.0
pipeline-graph-analysis:1.12
pipeline-input-step:2.12
pipeline-milestone-step:1.3.2
pipeline-model-api:1.9.3
pipeline-model-definition:1.9.3
pipeline-model-extensions:1.9.3
pipeline-rest-api:2.19
pipeline-stage-step:2.5
pipeline-stage-tags-metadata:1.9.3
pipeline-stage-view:2.19
plain-credentials:1.7
plugin-usage-plugin:2.1
plugin-util-api:2.5.1
popper-api:1.16.1-2
popper2-api:2.10.2-1
pubsub-light:1.16
resource-disposer:0.16
scm-api:2.6.5
script-security:1.78
slack:2.49
snakeyaml-api:1.29.1
sse-gateway:1.24
ssh-agent:1.23
ssh-credentials:1.19
ssh-slaves:1.33.0
sshd:3.1.0
structs:308.v852b473a2b8c
subversion:2.15.1
timestamper:1.15
token-macro:266.v44a80cf277fd
trilead-api:1.0.13
variant:1.4
windows-slaves:1.7
workflow-aggregator:2.6
workflow-api:2.47
workflow-basic-steps:2.24
workflow-cps:2640.v00e79c8113de
workflow-cps-global-lib:548.v9085a486966a
workflow-durable-task-step:1101.vf832bc1ac745
workflow-job:2.42
workflow-multibranch:2.26
workflow-scm-step:2.13
workflow-step-api:2.24
workflow-support:3.8
ws-cleanup:0.39
  • What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Debian 10
Agents: Windows Server 2019 Datacenter (Azure VMs)

Reproduction steps

  • We need to run some builds on Windows agents and they are very expensive on first run (we use a local vcpkg cache for compiling dependencies, so first run takes 90 minutes vs 5-10 minutes on subsequent runs). The original strategy for these was Idle Retention, 60 minutes, but they would not reliably come out of suspended state (https://github.com/jenkinsci/azure-vm-agents-plugin/issues/238), and eventually would starve other jobs that could not allocate Linux VMs before hitting the global cloud limit.
  • The revised strategy was to use Pool Retention with pool size = 1 and retention time = 0, so there would always be one VM running to accept new jobs (the template limit is 3, so it could scale up if busy). However, those VMs are only staying up about 3 hours before deallocating.
  • Today a VM deallocated within 2 hours of creation (after 90 minutes was spent building): <===[JENKINS REMOTING CAPACITY]===>Remoting version: 4.11.2 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated java.io.EOFException at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872) at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367) at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936) at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)

Results

Expected result:

At least 1 Windows VM should be available for scheduling at all times (the template limit is 3)

Actual result:

Have to manually "un-suspend" the Windows VM several times a day. This is a problem because there is no notification the agent is suspended, and most of the team is several time zones away.

tchrischan avatar Dec 09 '21 18:12 tchrischan

could you provide your configuration (redact whatever you need to)? Ideally as a configuration-as-code plugin export.

timja avatar Dec 09 '21 22:12 timja

No, I don't know why the bulitInImage for a Windows template is ubuntu 20.04 LTS (or why it's not correct for any of my templates). You can ignore the init script, we tried to extend the OS disk but I don't think that worked; the same VM behavior happened before that script was put in earlier this week.

[...]
jenkins:
[...]
  clouds:
  - azureVM:
      azureCredentialsId: ***redacted***
      cloudName: ***redacted***
      configurationStatus: "pass"
      deploymentTimeout: 1200
      existingResourceGroupName: ***redacted***
      maxVirtualMachinesLimit: 20
      resourceGroupReferenceType: "existing"
      vmTemplates:
[...]
      - agentLaunchMethod: "SSH"
        agentWorkspace: "c:\\jenkins"
        builtInImage: "Ubuntu 20.04 LTS"
        credentialsId: ***redacted***
        diskType: "managed"
        doNotUseMachineIfInitFails: false
        executeInitScriptAsRoot: false
        existingStorageAccountName: ***redacted***
        imageReference:
          galleryImageDefinition: ***redacted***
          galleryImageVersion: ***redacted***
          galleryName: ***redacted***
          galleryResourceGroup: ***redacted***
          gallerySubscriptionId: ***redacted***
        imageTopLevelType: "advanced"
        initScript: "Resize-Partition -DriveLetter C -Size ((Get-PartitionSupportedSize\
          \ -DriveLetter C).SizeMax)"
        javaPath: "java"
        labels: "windows"
        location: "East US"
        maximumDeploymentSize: 3
        newStorageAccountName: ***redacted***
        noOfParallelJobs: 1
        osDiskSize: 300
        osDiskStorageAccountType: "StandardSSD_LRS"
        osType: "Windows"
        retentionStrategy:
          azureVMCloudPool:
            poolSize: 1
            retentionInHours: 0
        shutdownOnIdle: true
        storageAccountNameReferenceType: "existing"
        storageAccountType: "Standard_LRS"
        subnetName: ***redacted***
        templateDesc: "Windows 2019 Datacenter pre-loaded for ***redacted*** builds"
        templateName: ***redacted***
        usageMode: EXCLUSIVE
        usePrivateIP: true
        virtualMachineSize: "Standard_D4ds_v5"
        virtualNetworkName: ***redacted***
        virtualNetworkResourceGroupName: ***redacted***
[...]

tchrischan avatar Dec 10 '21 04:12 tchrischan

If you’re managing it with jcasc you can remove built in image since https://github.com/jenkinsci/azure-vm-agents-plugin/releases/tag/795.vd5903dae1139

Doesn’t cause any harm though

timja avatar Dec 10 '21 07:12 timja

Why don’t you disable shutdownOnIdle?

timja avatar Dec 10 '21 07:12 timja

Because deleting the VM on idle instead means the local vcpkg cache is rebuilt on every run, so they will each take a very long time. That's what we're trying to avoid.

tchrischan avatar Dec 16 '21 14:12 tchrischan