google-compute-engine-plugin
google-compute-engine-plugin copied to clipboard
Long wait times to provision a VM in GCP
Jenkins and plugins versions report
Environment
Paste the output here
What Operating System are you using (both controller, and any agents involved in the problem)?
Jenkins: 2.440.3 OS: Linux - 6.5.0-1018-gcp Java: 17.0.10 - Private Build (OpenJDK 64-Bit Server VM)
Office-365-Connector:4.21.1 Parameterized-Remote-Trigger:3.2.0 ace-editor:1.1 ansicolor:1.0.4 ant:497.v94e7d9fffa_b_9 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 artifactory:4.0.6 asm-api:9.7-33.v4d23ef79fcc8 authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:231.v08a_59f17d742 aws-java-sdk-ec2:1.12.696-451.v0651a_da_9ca_ec aws-java-sdk-minimal:1.12.696-451.v0651a_da_9ca_ec blueocean:1.27.12 blueocean-bitbucket-pipeline:1.27.12 blueocean-commons:1.27.12 blueocean-config:1.27.12 blueocean-core-js:1.27.12 blueocean-dashboard:1.27.12 blueocean-display-url:2.4.2 blueocean-events:1.27.12 blueocean-git-pipeline:1.27.12 blueocean-github-pipeline:1.27.12 blueocean-i18n:1.27.12 blueocean-jwt:1.27.12 blueocean-personalization:1.27.12 blueocean-pipeline-api-impl:1.27.12 blueocean-pipeline-editor:1.27.12 blueocean-pipeline-scm-api:1.27.12 blueocean-rest:1.27.12 blueocean-rest-impl:1.27.12 blueocean-web:1.27.12 bootstrap5-api:5.3.3-1 bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9 branch-api:2.1169.va_f810c56e895 build-blocker-plugin:165.v5ecb_fb_f61520 build-name-setter:2.4.2 build-timeout:1.32 build-timestamp:1.0.3 build-user-vars-plugin:166.v52976843b_435 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.2.0 cloudbees-bitbucket-branch-source:886.v44cf5e4ecec5 cloudbees-folder:6.901.vb_4c7a_da_75da_3 command-launcher:107.v773860566e2e commons-httpclient3-api:3.1-3 commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.11.0-109.vfe16c66636eb_ conditional-buildstep:1.4.3 config-file-provider:973.vb_a_80ecb_9a_4d0 credentials:1337.v60b_d7b_c7b_c9f credentials-binding:677.vdc9d38cb_254d custom-folder-icon:2.12 datadog:7.0.1 date-parameter:0.0.4 description-setter:239.vd0a_6b_785f92d display-upstream-changes:0.3.2 display-url-api:2.204.vf6fddd8a_8b_e9 docker-commons:439.va_3cb_0a_6a_fb_29 docker-workflow:572.v950f58993843 durable-task:555.v6802fe0f0b_82 echarts-api:5.5.0-1 email-ext:1806.v856a_01a_fa_39a_ envinject:2.908.v66a_774b_31d93 envinject-api:1.199.v3ce31253ed13 environment-script:100.v3a_f1a_6a_b_7549 favorite:2.208.v91d65b_7792a_c file-parameters:316.va_83a_1221db_a_7 flexible-publish:0.16.1 font-awesome-api:6.5.2-1 generic-webhook-trigger:2.2.1 gerrit-code-review:0.4.9 gerrit-trigger:2.41.1 git:5.2.2 git-client:4.7.0 git-server:117.veb_68868fa_027 github:1.38.0 github-api:1.318-461.v7a_c09c9fa_d63 github-branch-source:1789.v5b_0c0cea_18c3 github-oauth:597.ve0c3480fcb_d0 github-pullrequest:0.7.0 google-compute-engine:4.573.v7dcd6a_37a_ee2 google-oauth-plugin:1.330.vf5e86021cb_ec gradle:2.11 gson-api:2.10.1-15.v0d99f670e0a_7 handy-uri-templates-2-api:2.1.8-30.v7e777411b_148 htmlpublisher:1.33 hudson-wsclean-plugin:1.0.8 instance-identity:185.v303dc7c645f9 ionicons-api:74.v93d5eb_813d5f jackson2-api:2.17.0-379.v02de8ec9f64c jakarta-activation-api:2.1.3-1 jakarta-mail-api:2.1.3-1 javadoc:243.vb_b_503b_b_45537 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.9-1 jdk-tool:73.vddf737284550 jenkins-design-language:1.27.12 jersey2-api:2.42-147.va_28a_44603b_d5 jfrog:1.5.0 jira:3.13 jjwt-api:0.11.5-112.ve82dfb_224b_a_d jnr-posix-api:3.1.19-2 job-import-plugin:3.6 jobConfigHistory:1229.v3039470161a_d joda-time-api:2.12.7-29.v5a_b_e3a_82269a_ jquery3-api:3.7.1-2 jsch:0.2.16-86.v42e010d9484b_ json-api:20240303-41.v94e11e6de726 json-path-api:2.9.0-58.v62e3e85b_a_655 junit:1265.v65b_14fa_f12f0 ldap:725.v3cb_b_711b_1a_ef logstash:2.5.0218.v0a_ff8fefc12b_ mailer:472.vf7c289a_4b_420 matrix-auth:3.2.2 matrix-project:822.824.v14451b_c0fd42 maven-plugin:3.23 mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd momentjs:1.1.1 multiple-scms:0.8 naginator:1.449.ve19751d70eb_0 nodelabelparameter:1.12.0 oauth-credentials:0.646.v02b_66dc03d2e okhttp-api:4.11.0-172.vda_da_1feeb_c6e pam-auth:1.10 parameterized-scheduler:262.v00f3d90585cc parameterized-trigger:806.vf6fff3e28c3e periodicbackup:2.0 pipeline-build-step:540.vb_e8849e1a_b_d8 pipeline-github:2.8-159.09e4403bc62f pipeline-github-lib:42.v0739460cda_c4 pipeline-graph-analysis:216.vfd8b_ece330ca_ pipeline-groovy-lib:704.vc58b_8890a_384 pipeline-input-step:495.ve9c153f6067b_ pipeline-milestone-step:119.vdfdc43fc3b_9a_ pipeline-model-api:2.2198.v41dd8ef6dd56 pipeline-model-definition:2.2198.v41dd8ef6dd56 pipeline-model-extensions:2.2198.v41dd8ef6dd56 pipeline-rest-api:2.34 pipeline-stage-step:312.v8cd10304c27a_ pipeline-stage-tags-metadata:2.2198.v41dd8ef6dd56 pipeline-stage-view:2.34 plain-credentials:179.vc5cb_98f6db_38 plugin-util-api:4.1.0 popper2-api:2.11.6-4 preSCMbuildstep:71.v1f2990a_37e27 pubsub-light:1.18 python:1.3 rebuild:332.va_1ee476d8f6d repo:1.16.0 resource-disposer:0.23 run-condition:1.7 scm-api:690.vfc8b_54395023 script-security:1336.vf33a_a_9863911 snakeyaml-api:2.2-111.vc6598e30cc65 sonar:2.17.2 sonar-gerrit:385.v890738eeff15 sse-gateway:1.26 ssh-credentials:337.v395d2403ccd4 ssh-slaves:2.948.vb_8050d697fec sshd:3.322.v159e91f6a_550 structs:337.v1b_04ea_4df7c8 timestamper:1.27 token-macro:400.v35420b_922dcb_ trilead-api:2.142.v748523a_76693 uno-choice:2.8.3 variant:60.v7290fc0eb_b_cd workflow-aggregator:596.v8c21c963d92d workflow-api:1291.v51fd2a_625da_7 workflow-basic-steps:1058.vcb_fc1e3a_21a_9 workflow-cps:3894.3896.vca_2c931e7935 workflow-cps-global-lib:612.v55f2f80781ef workflow-durable-task-step:1336.v768003e07199 workflow-job:1400.v7fd111b_ec82f workflow-multibranch:773.vc4fe1378f1d5 workflow-scm-step:427.v4ca_6512e7df1 workflow-step-api:657.v03b_e8115821b_ workflow-support:907.v6713a_ed8a_573 ws-cleanup:0.45
Reproduction steps
Configure cloud -> GCP configured Defined a label for provisioning the one-shot VM In a pipeline job use the label in agent directive
Expected Results
The job gets a VM provisioned in GCP with the node label provisioned
Actual Results
Getting the VM, delay is varying between 2 minutes to 1 hour 30 minutes. Do not see any logs in Jenkins job console, jenkins log or GCP logs. In the console log, before provisioning the VM, it prints all the available labels and waits there until the VM is provisioned
Anything else?
No response
Are you interested in contributing a fix?
No response
We see this frequently in our Jenkins log
May 27, 2024 12:36:21 PM WARNING com.google.cloud.graphite.platforms.plugin.client.ComputeClient lambda$waitForOperationCompletion$11 Error retrieving operation. com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found GET https://compute.googleapis.com/compute/v1/projects/my-proj/zones/us-central1-a/operations/operation-1716316733190-618fb21342fd8-e697bbc4-1db85ce9 { "code" : 404, "errors" : [ { "domain" : "global", "message" : "The resource 'projects/my-proj/zones/us-central1-a/operations/operation-1716316733190-618fb21342fd8-e697bbc4-1db85ce9' was not found", "reason" : "notFound" } ], "message" : "The resource 'projects/my-proj/zones/us-central1-a/operations/operation-1716316733190-618fb21342fd8-e697bbc4-1db85ce9' was not found" } at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146) at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118) at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:439) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576) at com.google.cloud.graphite.platforms.plugin.client.ComputeWrapper.getZoneOperation(ComputeWrapper.java:164) at com.google.cloud.graphite.platforms.plugin.client.ComputeClient.getZoneOperation(ComputeClient.java:584) at com.google.cloud.graphite.platforms.plugin.client.ComputeClient.lambda$waitForOperationCompletion$11(ComputeClient.java:673) at org.awaitility.core.CallableCondition$ConditionEvaluationWrapper.eval(CallableCondition.java:99) at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:248) at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:235) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)
This may or may not be relevant but could be helpful, This particular operation is referred first after a 'ZONE_RESOURCE_POOL_EXHAUSTED' error occured in GCP. Even after the resource is available in GCP, Jenkins is referrrring this occurance.
We do see the exact same stack trace. When there's a failure at the GCP endpoint, and the operation is unsuccessful, Jenkins is unable to forget about the operation that failed and our logs are filled with the same above message. The message indicates the operation has been deleted from GCP (it defaults to 30 days to be deleted from the API)... This means the issue probably occurred 30 days prior to see the message in the jenkins logs...
After we start seeing these errors, we observe Jenkins to lag on creating new instances on demand when a pipeline requests them (although creating them through the UI is immediate).
We haven't been able to find the root cause, but are suffering from this same scenario...
We saw that from some of the earlier events, the nodes were created/tried to be created. In the Jenkins nodes list they were persisting. After cleaning up all the stale nodes, we are not seeing this issue. We are observing for a week now, we did not see any reoccurrence of it. We are suspecting it was due to the stale nodes that were not deleted.
The error message looks a bit similar this problem https://github.com/jenkinsci/google-compute-engine-plugin/issues/469 - does your problem also go away after a couple of hours or does it persist indefinitely?
Removing old nodes in /var/lib/jenkins/nodes/ and restarting jenkins fixed the issue for me