`cml runner` failure edge case
Documenting an edge case that happened during a daily pulse check, the cml runner command failed but created an instance that was left on, Let's call it "unattended" or "orphaned"?
ran with
cml-runner \
--single \
--log=debug \
--idle-timeout=1800 \
--token=*** \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-standard-2 \
--cloud-hdd-size=10
de-potatoed log
{"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"}
{"level":"info","message":"Preparing workdir /home/runner/.cml/cml-oqbyyfr3qf..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner/.cml/cml-oqbyyfr3qf' apply -auto-approve
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# iterative_cml_runner.runner will be created
+ resource "iterative_cml_runner" "runner" {
+ cloud = "gcp"
+ docker_volumes = []
+ driver = "github"
+ id = (known after apply)
+ idle_timeout = 1800
+ instance_hdd_size = 10
+ instance_ip = (known after apply)
+ instance_launch_time = (known after apply)
+ instance_type = "e2-standard-2"
+ labels = "cml"
+ name = "cml-oqbyyfr3qf"
+ region = "us-west"
+ repo = "xxxx"
+ single = true
+ spot = false
+ spot_price = -1
+ ssh_public = (known after apply)
+ token = (sensitive value)
***
Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
╷
│ Error: Error checking the runner status
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
│ -- Logs begin at Mon 2022-03-07 16:07:02 UTC, end at Mon 2022-03-07
│ 16:10:01 UTC. --
│ Mar 07 16:09:58 cml-oqbyyfr3qf systemd[1]: Started cml.service.
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ {"level":"error","message":"terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n","stack":"Error: terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n\n at
│ /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n at
│ ChildProcess.exithandler (child_process.js:3[15](https://github.com/xxx/runs/5451314005?check_suite_focus=true#step:4:15):5)\n at ChildProcess.emit
│ (events.js:314:20)\n at maybeClose (internal/child_process.js:1022:[16](xxx/runs/5451314005?check_suite_focus=true#step:4:16))\n
│ at Process.ChildProcess._handle.onexit
│ (internal/child_process.js:287:5)","status":"terminated"}
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ {"level":"info","message":"waiting 20 seconds before exiting..."}
│
╵
","stack":"Error: terraform -chdir='/home/runner/.cml/cml-oqbyyfr3qf' apply -auto-approve
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# iterative_cml_runner.runner will be created
+ resource "iterative_cml_runner" "runner" {
+ cloud = "gcp"
+ docker_volumes = []
+ driver = "github"
+ id = (known after apply)
+ idle_timeout = 1800
+ instance_hdd_size = 10
+ instance_ip = (known after apply)
+ instance_launch_time = (known after apply)
+ instance_type = "e2-standard-2"
+ labels = "cml"
+ name = "cml-oqbyyfr3qf"
+ region = "us-west"
+ repo = "xxx"
+ single = true
+ spot = false
+ spot_price = -1
+ ssh_public = (known after apply)
+ token = (sensitive value)
}
Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
╷
│ Error: Error checking the runner status
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
│ -- Logs begin at Mon 2022-03-07 16:07:02 UTC, end at Mon 2022-03-07
│ 16:10:01 UTC. --
│ Mar 07 16:09:58 cml-oqbyyfr3qf systemd[1]: Started cml.service.
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ ***"level":"error","message":"terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n","stack":"Error: terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n\n at
│ /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n at
│ ChildProcess.exithandler (child_process.js:315:5)\n at ChildProcess.emit
│ (events.js:314:20)\n at maybeClose (internal/child_process.js:1022:16)\n
│ at Process.ChildProcess._handle.onexit
│ (internal/child_process.js:287:5)","status":"terminated"***
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ ***"level":"info","message":"waiting 20 seconds before exiting..."***
│
╵
at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27
at ChildProcess.exithandler (node:child_process:406:5)
at ChildProcess.emit (node:events:520:28)
at maybeClose (node:internal/child_process:1092:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated" }
The instance couldn't install terraform-provider-iterative to destroy itself because of a 403 error retrieving the hashes... 🤔
~@DavidGOrtega released a new version today. Maybe it was a glitch when the Terraform Registry swapped the latest version?~ Update: doesn't seem plausible, because this happened at 16:10 UTC and the new version was released at 17:10 UTC.
@dacbd I have tried a cloud runner and I can confirm that it works. It might be a transient registry error
Sorry, I should have included that I re-ran the job and it was indeed a transient error hence my "edge case". Just something to think about in a backlog 😅
/save for myself edge case:
cml runner \
--single \
--labels=target_vpc \
--token=*** \
--cloud=aws \
--cloud-region=us-west-2 \
--cloud-type=t3.small \
--cloud-aws-security-group=cml-target-rule \
{"level":"error","message":"terraform -chdir='/home/runner/.cml/cml-5gk9yn6lk9' apply -auto-approve
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# iterative_cml_runner.runner will be created
+ resource "iterative_cml_runner" "runner" {
+ aws_security_group = "cml-target-rule"
+ cloud = "aws"
+ cml_version = "0.13.0"
+ docker_volumes = []
+ driver = "github"
+ id = (known after apply)
+ idle_timeout = 300
+ instance_hdd_size = 35
+ instance_ip = (known after apply)
+ instance_launch_time = (known after apply)
+ instance_type = "t3.small"
+ labels = "target_vpc"
+ name = "cml-5gk9yn6lk9"
+ region = "us-west-2"
+ repo = "https://github.com/dacbd/cml-pulse"
+ single = true
+ spot = false
+ spot_price = -1
+ ssh_public = (known after apply)
+ startup_script = (sensitive value)
+ token = (sensitive value)
}
Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
iterative_cml_runner.runner: Still creating... [3m40s elapsed]
iterative_cml_runner.runner: Still creating... [3m50s elapsed]
iterative_cml_runner.runner: Still creating... [4m0s elapsed]
iterative_cml_runner.runner: Still creating... [4m10s elapsed]
iterative_cml_runner.runner: Still creating... [4m20s elapsed]
iterative_cml_runner.runner: Still creating... [4m30s elapsed]
iterative_cml_runner.runner: Still creating... [4m40s elapsed]
iterative_cml_runner.runner: Still creating... [4m50s elapsed]
iterative_cml_runner.runner: Still creating... [5m0s elapsed]
iterative_cml_runner.runner: Still creating... [5m10s elapsed]
iterative_cml_runner.runner: Still creating... [5m20s elapsed]
iterative_cml_runner.runner: Still creating... [5m30s elapsed]
iterative_cml_runner.runner: Still creating... [5m40s elapsed]
iterative_cml_runner.runner: Still creating... [5m50s elapsed]
iterative_cml_runner.runner: Still creating... [6m0s elapsed]
iterative_cml_runner.runner: Still creating... [6m10s elapsed]
iterative_cml_runner.runner: Still creating... [6m20s elapsed]
iterative_cml_runner.runner: Still creating... [6m30s elapsed]
iterative_cml_runner.runner: Still creating... [6m40s elapsed]
iterative_cml_runner.runner: Still creating... [6m50s elapsed]
iterative_cml_runner.runner: Still creating... [7m0s elapsed]
iterative_cml_runner.runner: Still creating... [7m10s elapsed]
iterative_cml_runner.runner: Still creating... [7m20s elapsed]
iterative_cml_runner.runner: Still creating... [7m30s elapsed]
iterative_cml_runner.runner: Still creating... [7m40s elapsed]
iterative_cml_runner.runner: Still creating... [7m50s elapsed]
iterative_cml_runner.runner: Still creating... [8m0s elapsed]
iterative_cml_runner.runner: Still creating... [8m10s elapsed]
iterative_cml_runner.runner: Still creating... [8m20s elapsed]
iterative_cml_runner.runner: Still creating... [8m30s elapsed]
iterative_cml_runner.runner: Still creating... [8m40s elapsed]
iterative_cml_runner.runner: Still creating... [8m50s elapsed]
iterative_cml_runner.runner: Still creating... [9m0s elapsed]
iterative_cml_runner.runner: Still creating... [9m10s elapsed]
iterative_cml_runner.runner: Still creating... [9m20s elapsed]
iterative_cml_runner.runner: Still creating... [9m30s elapsed]
iterative_cml_runner.runner: Still creating... [9m40s elapsed]
iterative_cml_runner.runner: Still creating... [9m50s elapsed]
iterative_cml_runner.runner: Still creating... [10m0s elapsed]
╷
│ Error: Failed disposing the machine: operation error EC2: DescribeInstances, failed to sign request: failed to retrieve credentials: request canceled, context deadline exceeded
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
╵
╷
│ Error: Error checking the runner status
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
│
│ dial tcp 18.237.22.141:22: i/o timeout
╵
","stack":"Error: terraform -chdir='/home/runner/.cml/cml-5gk9yn6lk9' apply -auto-approve
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# iterative_cml_runner.runner will be created
+ resource "iterative_cml_runner" "runner" {
+ aws_security_group = "cml-target-rule"
+ cloud = "aws"
+ cml_version = "0.13.0"
+ docker_volumes = []
+ driver = "github"
+ id = (known after apply)
+ idle_timeout = 300
+ instance_hdd_size = 35
+ instance_ip = (known after apply)
+ instance_launch_time = (known after apply)
+ instance_type = "t3.small"
+ labels = "target_vpc"
+ name = "cml-5gk9yn6lk9"
+ region = "us-west-2"
+ repo = "https://github.com/dacbd/cml-pulse"
+ single = true
+ spot = false
+ spot_price = -1
+ ssh_public = (known after apply)
+ startup_script = (sensitive value)
+ token = (sensitive value)
}
Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
iterative_cml_runner.runner: Still creating... [3m40s elapsed]
iterative_cml_runner.runner: Still creating... [3m50s elapsed]
iterative_cml_runner.runner: Still creating... [4m0s elapsed]
iterative_cml_runner.runner: Still creating... [4m10s elapsed]
iterative_cml_runner.runner: Still creating... [4m20s elapsed]
iterative_cml_runner.runner: Still creating... [4m30s elapsed]
iterative_cml_runner.runner: Still creating... [4m40s elapsed]
iterative_cml_runner.runner: Still creating... [4m50s elapsed]
iterative_cml_runner.runner: Still creating... [5m0s elapsed]
iterative_cml_runner.runner: Still creating... [5m10s elapsed]
iterative_cml_runner.runner: Still creating... [5m20s elapsed]
iterative_cml_runner.runner: Still creating... [5m30s elapsed]
iterative_cml_runner.runner: Still creating... [5m40s elapsed]
iterative_cml_runner.runner: Still creating... [5m50s elapsed]
iterative_cml_runner.runner: Still creating... [6m0s elapsed]
iterative_cml_runner.runner: Still creating... [6m10s elapsed]
iterative_cml_runner.runner: Still creating... [6m20s elapsed]
iterative_cml_runner.runner: Still creating... [6m30s elapsed]
iterative_cml_runner.runner: Still creating... [6m40s elapsed]
iterative_cml_runner.runner: Still creating... [6m50s elapsed]
iterative_cml_runner.runner: Still creating... [7m0s elapsed]
iterative_cml_runner.runner: Still creating... [7m10s elapsed]
iterative_cml_runner.runner: Still creating... [7m20s elapsed]
iterative_cml_runner.runner: Still creating... [7m30s elapsed]
iterative_cml_runner.runner: Still creating... [7m40s elapsed]
iterative_cml_runner.runner: Still creating... [7m50s elapsed]
iterative_cml_runner.runner: Still creating... [8m0s elapsed]
iterative_cml_runner.runner: Still creating... [8m10s elapsed]
iterative_cml_runner.runner: Still creating... [8m20s elapsed]
iterative_cml_runner.runner: Still creating... [8m30s elapsed]
iterative_cml_runner.runner: Still creating... [8m40s elapsed]
iterative_cml_runner.runner: Still creating... [8m50s elapsed]
iterative_cml_runner.runner: Still creating... [9m0s elapsed]
iterative_cml_runner.runner: Still creating... [9m10s elapsed]
iterative_cml_runner.runner: Still creating... [9m20s elapsed]
iterative_cml_runner.runner: Still creating... [9m30s elapsed]
iterative_cml_runner.runner: Still creating... [9m40s elapsed]
iterative_cml_runner.runner: Still creating... [9m50s elapsed]
iterative_cml_runner.runner: Still creating... [10m0s elapsed]
╷
│ Error: Failed disposing the machine: operation error EC2: DescribeInstances, failed to sign request: failed to retrieve credentials: request canceled, context deadline exceeded
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
╵
╷
│ Error: Error checking the runner status
│
│ with iterative_cml_runner.runner,
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
│
│ dial tcp 18.237.22.141:22: i/o timeout
╵
at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27
at ChildProcess.exithandler (node:child_process:406:5)
at ChildProcess.emit (node:events:526:28)
at maybeClose (node:internal/child_process:1092:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
getting the second version of this bug too
@casperdcl I went back through my commits/gha runs / and combed through the AWS cloud trail log for the time frame to double-check I didn't change the cloud env at all. I'm not seeing anything obvious that I changed to not get that error a second time.
I want to through it up to a "transient network" type of error. Unless someone is able to consistently produce this error I think the only action point is more defensive handling of errors / some kind of retry mechanism for certain error types?
If it does reoccur I think: https://github.com/iterative/cml/pull/1052 will be helpful