GCP VM instance does not terminate after idle timeout
Related to issue #834
Facing this issue with cml v0.19.0
I'm starting GCP instance using the following command inside a docker container which uses iterativeai/cml:latest image in a buildkite pipeline. I also share GOOGLE_APPLICATION_CREDENTIALS_DATA as environment variable, it starts and register self-hosted runner without any problem.
I configured gcloud with:
echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project PROJECT
gcloud -q auth configure-docker
run cml runner launch with
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4 \
--name="$CML_RUNNER_NAME" \
--labels=gcp-snapshot-creation \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/REPO.git \
--cloud-hdd-size=60 \
--idle-timeout=120
However, after idle-timeout passes GCP VM doesn't terminate. The output of journalctl --unit cml --no-pager is as follows:
-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded
I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.
@hopeai Can you share more of the workflow that you are running even if it's mostly redacted? I going to try and reproduce the behavior you are seeing again.
Sure @dacbd,
I'm using the following Buildkite pipeline to manually launch a gpu server when I need.
agents:
queue: "my-buildkite-agent"
steps:
- block: ":warning: Provision ec2 instance"
- command: "buildkite-agent pipeline upload .buildkite/cml_launch_runner.yml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'"
- wait
- command: "buildkite-agent pipeline upload .buildkite/trigger_my_buildkite_pipeline.yaml"
label: ":pipeline: Pipeline upload"
if: "build.branch == 'MYBRANCH'"
The two pipelines which previous pipeline uploads are as follows:
- .buildkite/cml_launch_runner.yml
agents:
queue: "my-buildkite-agent"
steps:
- label: ":gcloud: setup gcp creds"
command: "./scripts/configure_gcloud.sh"
- wait
- label: ":gcloud: launch GCP Snapshot Creation Instance"
command:
- "./scripts/cml_launch_runner.sh"
plugins:
- docker#v5.3.0:
image: "iterativeai/cml:latest"
environment:
- "BUILDKITE_GITHUB_USERNAME"
- "BUILDKITE_GITHUB_TOKEN"
- "BUILDKITE_BRANCH"
- "BUILDKITE_BUILD_NUMBER"
- "BUILDKITE_COMMIT"
- "GOOGLE_APPLICATION_CREDENTIALS_DATA"
- .buildkite/trigger_my_buildkite_pipeline.yaml
agents:
queue: "my-buildkite-agent"
steps:
- label: ":github: Trigger Github Actions"
command:
- "./scripts/trigger_github_workflow.sh"
The bash scripts which are used in the pipelines are as follows:
- ./scripts/configure_gcloud.sh
#!/bin/bash
set -eu -o pipefail
echo ":gcloud: Configure gcloud..."
export GOOGLE_APPLICATION_CREDENTIALS_DATA=$BUILDKITE_GS_APPLICATION_CREDENTIALS_JSON
echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project MY-PROJECT
gcloud -q auth configure-docker
gcloud -v
echo "Done!"
- ./scripts/cml_launch_runner.sh
#!/bin/bash
set -eu -o pipefail
# Build a unique instance name
CML_RUNNER_NAME="some-name"
# Provision an GCP EC2 g4dn.xlage instance
cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4 \
--name="$CML_RUNNER_NAME" \
--labels=gpu-cml \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/myrepo.git \
--cloud-hdd-size=60 \
--idle-timeout=120
- ./scripts/trigger_github_workflow.sh
#!/bin/bash
curl -L \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $BUILDKITE_GITHUB_TOKEN" \
-H "X-GitHub-Api-Version: 2022-11-28" \
--url https://api.github.com/repos/intenseye/myrepo/actions/workflows/my_workflow.yml/dispatches \
-d '{"ref":"'"$BUILDKITE_BRANCH"'"}'
By running these pipelines everything works as expected and cml starts a gpu server on GCP and this self-hosted runner accepts my GitHub workflow jobs and jobs finish successfully. However, after idle-timeout GCP VM is not terminated. I checked the logs in self-hosted runner and shared here. I think this could help you to reproduce this behavior.
Related to issue #834
Facing this issue with cml v0.19.0
I'm starting GCP instance using the following command inside a docker container which uses
iterativeai/cml:latestimage in a buildkite pipeline. I also shareGOOGLE_APPLICATION_CREDENTIALS_DATAas environment variable, it starts and register self-hosted runner without any problem.I configured gcloud with:
echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json gcloud -q config set project PROJECT gcloud -q auth configure-dockerrun
cml runner launchwithcml runner launch \ --cloud=gcp \ --cloud-region=us-central1-a \ --cloud-type=m+t4 \ --name="$CML_RUNNER_NAME" \ --labels=gcp-snapshot-creation \ --token="$BUILDKITE_GITHUB_TOKEN" \ --repo=https://github.com/ORG/REPO.git \ --cloud-hdd-size=60 \ --idle-timeout=120However, after idle-timeout passes GCP VM doesn't terminate. The output of
journalctl --unit cml --no-pageris as follows:-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. -- Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service. Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"} Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"} Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."} Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."} Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"} Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"} Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."} Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."} Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"} Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."} Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"} Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"} Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"} Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true} Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"} Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true} Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."} Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"} Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"} Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"} Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"} Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"} Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"} Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"} Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: SucceededI am not facing any issue with aws ec2 instances and those terminate after idle-timeout.
The total runtime Is close to one hour which makes me think that your gcloud -q auth activate-service-accouth method of authorization is generating an Oauth token which has expired by the time CML tries to shutdown.
I believe if you run the following query in GCP's "log explorer" you will NOT see a delete request (because it will be unauthorised from the expired credentials)
protoPayload.authenticationInfo.principalEmail="your_service_account_name@your_project_name.iam.gserviceaccount.com"
resource.type="gce_instance"
protoPayload.methodName="v1.compute.instances.delete"
Can you try authenticating your github actions via the following:
- name: 'Authenticate to Google Cloud'
uses: 'google-github-actions/auth@v0'
with:
credentials_json: ${{ secrets.GCP_CML_RUNNER_KEY }}
where secrets.GCP_CML_RUNNER_KEY is the contents of the json key. ( I would remove the white space after and before the jsons opening/closing {}'s so that github actions doesn't try to obsfcate single characters like { or })
you can also configure docker in a similar fashion:
- name: Login to GAR
uses: docker/login-action@v2
with:
registry: <location>-docker.pkg.dev
username: _json_key
password: ${{ secrets.GAR_JSON_KEY }}
Thanks a lot @dacbd for pointing this out. Let me try this