cml icon indicating copy to clipboard operation
cml copied to clipboard

cml runner fails to provision VM on Azure

Open francesco086 opened this issue 3 years ago • 16 comments

I used the following as simple "hello world" for cml runner on GitLab Community Edition [14.10.1]:

deploy-runner:
  image: iterativeai/cml:latest
  script:
    - |
      cml runner \
          --cloud=azure \
          --cloud-region=eu-west \
          --cloud-type=s \
          --cloud-spot \
          --labels=cml-vm

train-model:
  needs: [deploy-runner]
  tags:
    - cml-vm
  image: ubuntu:latest
  script:
    - echo "hello"

I set up

  • AZURE_CLIENT_ID
  • AZURE_CLIENT_SECRET
  • AZURE_SUBSCRIPTION_ID
  • AZURE_TENANT_ID
  • REPO_TOKEN as well as PERSONAL_ACCESS_TOKEN as I found the documentation confusing about this (https://cml.dev/doc/self-hosted-runners?tab=GitLab#personal-access-token)

Unfortunately this results in:

$ cml runner \ # collapsed multi-line command
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n","stack":"Error: terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n\n    at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at ChildProcess.exithandler (node:child_process:406:5)\n    at ChildProcess.emit (node:events:527:28)\n    at maybeClose (node:internal/child_process:1092:16)\n    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
{"level":"info","message":"waiting 10 seconds before exiting..."}

I and my team could not understand what the problem is.

Additional info:

  • I tried to use cml also with the iterativeai/cml:0-dvc2-base1 docker image
  • I tried to use Azure specific type and region, but no success

Any help would be very much appreciated.

francesco086 avatar May 25 '22 14:05 francesco086

Error message, extracted from the logs above

Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
		 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>
 Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F

0x2b3bfa0 avatar May 25 '22 14:05 0x2b3bfa0

Hello, @francesco086! It sounds like an issue with your credentials. How did you get them?

0x2b3bfa0 avatar May 25 '22 14:05 0x2b3bfa0

Mmm... the error message is not very clear imho, it would be nice if it could be more explicit.

I will be off of work for the next days, will check the credentials again next week and come back with an update. For the time being, many thanks for the very fast reaction!!

francesco086 avatar May 25 '22 18:05 francesco086

Mmm... the error message is not very clear imho, it would be nice if it could be more explicit.

Unfortunately CML just communicates you what Azure tells...

DavidGOrtega avatar May 26 '22 09:05 DavidGOrtega

Am I right in assuming that cml tries to create a resource group dedicated to the VM used for the runner? Is there any way to customize this behavior (i.e. supply an existing resource group)? Because I'm not convinced incorrect credentials are the whole story. Rather I suspect that CML tries to create the resource group, but fails (because the SP doesn't have the necessary role), and then tries to obtain an access token to a resource group (iterative-37d31qzqeb13b in our example) that doesn't exist (status code is 404 after all).

lleiding avatar May 27 '22 08:05 lleiding

Am I right in assuming that cml tries to create a resource group dedicated to the VM used for the runner?

Yes

Is there any way to customize this behavior (i.e. supply an existing resource group)?

No, that's part of the cml runner hardcoded behavior

Because I'm not convinced incorrect credentials are the whole story. Rather I suspect that CML tries to create the resource group, but fails (because the SP doesn't have the necessary role), and then tries to obtain an access token to a resource group (iterative-37d31qzqeb13b in our example) that doesn't exist (status code is 404 after all).

Sounds plausible :+1:

0x2b3bfa0 avatar May 27 '22 16:05 0x2b3bfa0

See this guide and the permissions/az directory in the provider repository for a list of required permissions.

0x2b3bfa0 avatar May 27 '22 16:05 0x2b3bfa0

@0x2b3bfa0 thank you! Then I guess that is the source of the problem, I think we can close the issue.

For us (@lleiding is a colleague of mine) this may be a bit problematic. In our team we are trying to set things so to that each project has its own resource group. So what Ileiding asked could become a feature request: "Is there any way to customize this behavior (i.e. supply an existing resource group)?"

francesco086 avatar May 27 '22 16:05 francesco086

There's --cloud-permission-set and more docs coming soon in https://github.com/iterative/cml.dev/pull/242?

EDIT: not relevant; see below.

casperdcl avatar May 27 '22 17:05 casperdcl

Is this new option for GCP and AWS only? no Azure?

francesco086 avatar May 27 '22 19:05 francesco086

In our team we are trying to set things so to that each project has its own resource group.

Do you have the possibility of having a separate subscription for every team instead?

So what Ileiding asked could become a feature request: "Is there any way to customize this behavior (i.e. supply an existing resource group)?"

Feel free to open a follow-up issue, although it's unlikely that we will implement it anytime soon. The current functionality relies heavily on the fact of deleting a whole resource group with a single API call. 😅

0x2b3bfa0 avatar May 27 '22 20:05 0x2b3bfa0

Is this new option for GCP and AWS only? no Azure?

Yes, still not supported on Azure, but please upvote & consider watching the following issues:

  • https://github.com/iterative/terraform-provider-iterative/issues/235
  • https://github.com/iterative/terraform-provider-iterative/issues/559

Note that --cloud-permission-set is not related to your issue, though: it's just to use managed identities inside your workflows.

0x2b3bfa0 avatar May 27 '22 20:05 0x2b3bfa0

Closing as per https://github.com/iterative/cml/issues/1019#issuecomment-1139778698; @francesco086, feel free to reopen this issue if you deem it opportune.

0x2b3bfa0 avatar May 29 '22 19:05 0x2b3bfa0

To recap @0x2b3bfa0 / @francesco086 the issue is in az nested resource groups arent supported and we are using a resource group to clean up all resources with a single API call, but here they have credentials that are only valid for a predefined resource group.

dacbd avatar Jun 13 '22 17:06 dacbd

nested resource groups aren't supported

In the sense that they aren't even a thing: resource groups are a flat structure by design, and can't be nested.

0x2b3bfa0 avatar Jun 13 '22 18:06 0x2b3bfa0

Possible solutions

  • Allow cml runner to create and delete resource groups at will, as long as their name matches a pattern[^1]
  • Use a subscription to isolate runners instead of a resource group, as suggested on cml#1019 (comment)
  • Use a fixed deployment (i.e. the officially recommended solution); hard to implement

[^1]: Azure role assignment contitions are still in preview 🙃

0x2b3bfa0 avatar Jun 17 '22 02:06 0x2b3bfa0

@iterative/cml, any objections to wontfix for now?

0x2b3bfa0 avatar Oct 12 '22 16:10 0x2b3bfa0