cml icon indicating copy to clipboard operation
cml copied to clipboard

`cml runner` failure edge case

Open dacbd opened this issue 4 years ago • 7 comments

Documenting an edge case that happened during a daily pulse check, the cml runner command failed but created an instance that was left on, Let's call it "unattended" or "orphaned"?

ran with

cml-runner \
    --single \
    --log=debug \
    --idle-timeout=1800 \
    --token=*** \
    --cloud=gcp \
    --cloud-region=us-west \
    --cloud-type=e2-standard-2 \
    --cloud-hdd-size=10

de-potatoed log

{"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"}
{"level":"info","message":"Preparing workdir /home/runner/.cml/cml-oqbyyfr3qf..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner/.cml/cml-oqbyyfr3qf' apply -auto-approve
	
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # iterative_cml_runner.runner will be created
  + resource "iterative_cml_runner" "runner" {
      + cloud                = "gcp"
      + docker_volumes       = []
      + driver               = "github"
      + id                   = (known after apply)
      + idle_timeout         = 1800
      + instance_hdd_size    = 10
      + instance_ip          = (known after apply)
      + instance_launch_time = (known after apply)
      + instance_type        = "e2-standard-2"
      + labels               = "cml"
      + name                 = "cml-oqbyyfr3qf"
      + region               = "us-west"
      + repo                 = "xxxx"
      + single               = true
      + spot                 = false
      + spot_price           = -1
      + ssh_public           = (known after apply)
      + token                = (sensitive value)
    ***

Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]

	╷
│ Error: Error checking the runner status
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
│ -- Logs begin at Mon 2022-03-07 16:07:02 UTC, end at Mon 2022-03-07
│ 16:10:01 UTC. --
│ Mar 07 16:09:58 cml-oqbyyfr3qf systemd[1]: Started cml.service.
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ {"level":"error","message":"terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n","stack":"Error: terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n\n    at
│ /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at
│ ChildProcess.exithandler (child_process.js:3[15](https://github.com/xxx/runs/5451314005?check_suite_focus=true#step:4:15):5)\n    at ChildProcess.emit
│ (events.js:314:20)\n    at maybeClose (internal/child_process.js:1022:[16](xxx/runs/5451314005?check_suite_focus=true#step:4:16))\n
│ at Process.ChildProcess._handle.onexit
│ (internal/child_process.js:287:5)","status":"terminated"}
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ {"level":"info","message":"waiting 20 seconds before exiting..."}
│ 
╵
","stack":"Error: terraform -chdir='/home/runner/.cml/cml-oqbyyfr3qf' apply -auto-approve
	
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # iterative_cml_runner.runner will be created
  + resource "iterative_cml_runner" "runner" {
      + cloud                = "gcp"
      + docker_volumes       = []
      + driver               = "github"
      + id                   = (known after apply)
      + idle_timeout         = 1800
      + instance_hdd_size    = 10
      + instance_ip          = (known after apply)
      + instance_launch_time = (known after apply)
      + instance_type        = "e2-standard-2"
      + labels               = "cml"
      + name                 = "cml-oqbyyfr3qf"
      + region               = "us-west"
      + repo                 = "xxx"
      + single               = true
      + spot                 = false
      + spot_price           = -1
      + ssh_public           = (known after apply)
      + token                = (sensitive value)
    }

Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]

	╷
│ Error: Error checking the runner status
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
│ -- Logs begin at Mon 2022-03-07 16:07:02 UTC, end at Mon 2022-03-07
│ 16:10:01 UTC. --
│ Mar 07 16:09:58 cml-oqbyyfr3qf systemd[1]: Started cml.service.
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ ***"level":"error","message":"terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n","stack":"Error: terraform
│ -chdir='/tmp/tmp.sJzxlHtykd/.cml/cml-oqbyyfr3qf' init\n\t\nInitializing the
│ backend...\n\nInitializing provider plugins...\n- Finding latest version of
│ iterative/iterative...\n\n\t╷\n│ Error: Failed to install provider\n│ \n│
│ Error while installing iterative/iterative v0.9.14: could not query\n│
│ provider registry for registry.terraform.io/iterative/iterative: failed
│ to\n│ retrieve authentication checksums for provider: 403
│ Forbidden\n╵\n\n\n    at
│ /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at
│ ChildProcess.exithandler (child_process.js:315:5)\n    at ChildProcess.emit
│ (events.js:314:20)\n    at maybeClose (internal/child_process.js:1022:16)\n
│ at Process.ChildProcess._handle.onexit
│ (internal/child_process.js:287:5)","status":"terminated"***
│ Mar 07 16:10:00 cml-oqbyyfr3qf cml.sh[16091]:
│ ***"level":"info","message":"waiting 20 seconds before exiting..."***
│ 
╵

    at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27
    at ChildProcess.exithandler (node:child_process:406:5)
    at ChildProcess.emit (node:events:520:28)
    at maybeClose (node:internal/child_process:1092:16)
    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated" }

dacbd avatar Mar 07 '22 18:03 dacbd

The instance couldn't install terraform-provider-iterative to destroy itself because of a 403 error retrieving the hashes... 🤔

~@DavidGOrtega released a new version today. Maybe it was a glitch when the Terraform Registry swapped the latest version?~ Update: doesn't seem plausible, because this happened at 16:10 UTC and the new version was released at 17:10 UTC.

0x2b3bfa0 avatar Mar 07 '22 18:03 0x2b3bfa0

@dacbd I have tried a cloud runner and I can confirm that it works. It might be a transient registry error

DavidGOrtega avatar Mar 07 '22 18:03 DavidGOrtega

Sorry, I should have included that I re-ran the job and it was indeed a transient error hence my "edge case". Just something to think about in a backlog 😅

dacbd avatar Mar 07 '22 18:03 dacbd

/save for myself edge case:

  cml runner \
    --single \
    --labels=target_vpc \
    --token=*** \
    --cloud=aws \
    --cloud-region=us-west-2 \
    --cloud-type=t3.small \
    --cloud-aws-security-group=cml-target-rule \

logs

{"level":"error","message":"terraform -chdir='/home/runner/.cml/cml-5gk9yn6lk9' apply -auto-approve
	
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # iterative_cml_runner.runner will be created
  + resource "iterative_cml_runner" "runner" {
      + aws_security_group   = "cml-target-rule"
      + cloud                = "aws"
      + cml_version          = "0.13.0"
      + docker_volumes       = []
      + driver               = "github"
      + id                   = (known after apply)
      + idle_timeout         = 300
      + instance_hdd_size    = 35
      + instance_ip          = (known after apply)
      + instance_launch_time = (known after apply)
      + instance_type        = "t3.small"
      + labels               = "target_vpc"
      + name                 = "cml-5gk9yn6lk9"
      + region               = "us-west-2"
      + repo                 = "https://github.com/dacbd/cml-pulse"
      + single               = true
      + spot                 = false
      + spot_price           = -1
      + ssh_public           = (known after apply)
      + startup_script       = (sensitive value)
      + token                = (sensitive value)
    }

Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
iterative_cml_runner.runner: Still creating... [3m40s elapsed]
iterative_cml_runner.runner: Still creating... [3m50s elapsed]
iterative_cml_runner.runner: Still creating... [4m0s elapsed]
iterative_cml_runner.runner: Still creating... [4m10s elapsed]
iterative_cml_runner.runner: Still creating... [4m20s elapsed]
iterative_cml_runner.runner: Still creating... [4m30s elapsed]
iterative_cml_runner.runner: Still creating... [4m40s elapsed]
iterative_cml_runner.runner: Still creating... [4m50s elapsed]
iterative_cml_runner.runner: Still creating... [5m0s elapsed]
iterative_cml_runner.runner: Still creating... [5m10s elapsed]
iterative_cml_runner.runner: Still creating... [5m20s elapsed]
iterative_cml_runner.runner: Still creating... [5m30s elapsed]
iterative_cml_runner.runner: Still creating... [5m40s elapsed]
iterative_cml_runner.runner: Still creating... [5m50s elapsed]
iterative_cml_runner.runner: Still creating... [6m0s elapsed]
iterative_cml_runner.runner: Still creating... [6m10s elapsed]
iterative_cml_runner.runner: Still creating... [6m20s elapsed]
iterative_cml_runner.runner: Still creating... [6m30s elapsed]
iterative_cml_runner.runner: Still creating... [6m40s elapsed]
iterative_cml_runner.runner: Still creating... [6m50s elapsed]
iterative_cml_runner.runner: Still creating... [7m0s elapsed]
iterative_cml_runner.runner: Still creating... [7m10s elapsed]
iterative_cml_runner.runner: Still creating... [7m20s elapsed]
iterative_cml_runner.runner: Still creating... [7m30s elapsed]
iterative_cml_runner.runner: Still creating... [7m40s elapsed]
iterative_cml_runner.runner: Still creating... [7m50s elapsed]
iterative_cml_runner.runner: Still creating... [8m0s elapsed]
iterative_cml_runner.runner: Still creating... [8m10s elapsed]
iterative_cml_runner.runner: Still creating... [8m20s elapsed]
iterative_cml_runner.runner: Still creating... [8m30s elapsed]
iterative_cml_runner.runner: Still creating... [8m40s elapsed]
iterative_cml_runner.runner: Still creating... [8m50s elapsed]
iterative_cml_runner.runner: Still creating... [9m0s elapsed]
iterative_cml_runner.runner: Still creating... [9m10s elapsed]
iterative_cml_runner.runner: Still creating... [9m20s elapsed]
iterative_cml_runner.runner: Still creating... [9m30s elapsed]
iterative_cml_runner.runner: Still creating... [9m40s elapsed]
iterative_cml_runner.runner: Still creating... [9m50s elapsed]
iterative_cml_runner.runner: Still creating... [10m0s elapsed]

	╷
│ Error: Failed disposing the machine: operation error EC2: DescribeInstances, failed to sign request: failed to retrieve credentials: request canceled, context deadline exceeded
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
╵
╷
│ Error: Error checking the runner status
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
│ 
│ dial tcp 18.237.22.141:22: i/o timeout
╵
","stack":"Error: terraform -chdir='/home/runner/.cml/cml-5gk9yn6lk9' apply -auto-approve
	
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # iterative_cml_runner.runner will be created
  + resource "iterative_cml_runner" "runner" {
      + aws_security_group   = "cml-target-rule"
      + cloud                = "aws"
      + cml_version          = "0.13.0"
      + docker_volumes       = []
      + driver               = "github"
      + id                   = (known after apply)
      + idle_timeout         = 300
      + instance_hdd_size    = 35
      + instance_ip          = (known after apply)
      + instance_launch_time = (known after apply)
      + instance_type        = "t3.small"
      + labels               = "target_vpc"
      + name                 = "cml-5gk9yn6lk9"
      + region               = "us-west-2"
      + repo                 = "https://github.com/dacbd/cml-pulse"
      + single               = true
      + spot                 = false
      + spot_price           = -1
      + ssh_public           = (known after apply)
      + startup_script       = (sensitive value)
      + token                = (sensitive value)
    }

Plan: 1 to add, 0 to change, 0 to destroy.
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [50s elapsed]
iterative_cml_runner.runner: Still creating... [1m0s elapsed]
iterative_cml_runner.runner: Still creating... [1m10s elapsed]
iterative_cml_runner.runner: Still creating... [1m20s elapsed]
iterative_cml_runner.runner: Still creating... [1m30s elapsed]
iterative_cml_runner.runner: Still creating... [1m40s elapsed]
iterative_cml_runner.runner: Still creating... [1m50s elapsed]
iterative_cml_runner.runner: Still creating... [2m0s elapsed]
iterative_cml_runner.runner: Still creating... [2m10s elapsed]
iterative_cml_runner.runner: Still creating... [2m20s elapsed]
iterative_cml_runner.runner: Still creating... [2m30s elapsed]
iterative_cml_runner.runner: Still creating... [2m40s elapsed]
iterative_cml_runner.runner: Still creating... [2m50s elapsed]
iterative_cml_runner.runner: Still creating... [3m0s elapsed]
iterative_cml_runner.runner: Still creating... [3m10s elapsed]
iterative_cml_runner.runner: Still creating... [3m20s elapsed]
iterative_cml_runner.runner: Still creating... [3m30s elapsed]
iterative_cml_runner.runner: Still creating... [3m40s elapsed]
iterative_cml_runner.runner: Still creating... [3m50s elapsed]
iterative_cml_runner.runner: Still creating... [4m0s elapsed]
iterative_cml_runner.runner: Still creating... [4m10s elapsed]
iterative_cml_runner.runner: Still creating... [4m20s elapsed]
iterative_cml_runner.runner: Still creating... [4m30s elapsed]
iterative_cml_runner.runner: Still creating... [4m40s elapsed]
iterative_cml_runner.runner: Still creating... [4m50s elapsed]
iterative_cml_runner.runner: Still creating... [5m0s elapsed]
iterative_cml_runner.runner: Still creating... [5m10s elapsed]
iterative_cml_runner.runner: Still creating... [5m20s elapsed]
iterative_cml_runner.runner: Still creating... [5m30s elapsed]
iterative_cml_runner.runner: Still creating... [5m40s elapsed]
iterative_cml_runner.runner: Still creating... [5m50s elapsed]
iterative_cml_runner.runner: Still creating... [6m0s elapsed]
iterative_cml_runner.runner: Still creating... [6m10s elapsed]
iterative_cml_runner.runner: Still creating... [6m20s elapsed]
iterative_cml_runner.runner: Still creating... [6m30s elapsed]
iterative_cml_runner.runner: Still creating... [6m40s elapsed]
iterative_cml_runner.runner: Still creating... [6m50s elapsed]
iterative_cml_runner.runner: Still creating... [7m0s elapsed]
iterative_cml_runner.runner: Still creating... [7m10s elapsed]
iterative_cml_runner.runner: Still creating... [7m20s elapsed]
iterative_cml_runner.runner: Still creating... [7m30s elapsed]
iterative_cml_runner.runner: Still creating... [7m40s elapsed]
iterative_cml_runner.runner: Still creating... [7m50s elapsed]
iterative_cml_runner.runner: Still creating... [8m0s elapsed]
iterative_cml_runner.runner: Still creating... [8m10s elapsed]
iterative_cml_runner.runner: Still creating... [8m20s elapsed]
iterative_cml_runner.runner: Still creating... [8m30s elapsed]
iterative_cml_runner.runner: Still creating... [8m40s elapsed]
iterative_cml_runner.runner: Still creating... [8m50s elapsed]
iterative_cml_runner.runner: Still creating... [9m0s elapsed]
iterative_cml_runner.runner: Still creating... [9m10s elapsed]
iterative_cml_runner.runner: Still creating... [9m20s elapsed]
iterative_cml_runner.runner: Still creating... [9m30s elapsed]
iterative_cml_runner.runner: Still creating... [9m40s elapsed]
iterative_cml_runner.runner: Still creating... [9m50s elapsed]
iterative_cml_runner.runner: Still creating... [10m0s elapsed]

	╷
│ Error: Failed disposing the machine: operation error EC2: DescribeInstances, failed to sign request: failed to retrieve credentials: request canceled, context deadline exceeded
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
╵
╷
│ Error: Error checking the runner status
│ 
│   with iterative_cml_runner.runner,
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
│ 
│ dial tcp 18.237.22.141:22: i/o timeout
╵

    at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:20:27
    at ChildProcess.exithandler (node:child_process:406:5)
    at ChildProcess.emit (node:events:526:28)
    at maybeClose (node:internal/child_process:1092:16)
    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}

dacbd avatar Apr 12 '22 00:04 dacbd

getting the second version of this bug too

casperdcl avatar May 25 '22 16:05 casperdcl

@casperdcl I went back through my commits/gha runs / and combed through the AWS cloud trail log for the time frame to double-check I didn't change the cloud env at all. I'm not seeing anything obvious that I changed to not get that error a second time.

I want to through it up to a "transient network" type of error. Unless someone is able to consistently produce this error I think the only action point is more defensive handling of errors / some kind of retry mechanism for certain error types?

dacbd avatar Jun 13 '22 16:06 dacbd

If it does reoccur I think: https://github.com/iterative/cml/pull/1052 will be helpful

dacbd avatar Jun 13 '22 16:06 dacbd