cml runner fails to provision VM on Azure
I used the following as simple "hello world" for cml runner on GitLab Community Edition [14.10.1]:
deploy-runner:
image: iterativeai/cml:latest
script:
- |
cml runner \
--cloud=azure \
--cloud-region=eu-west \
--cloud-type=s \
--cloud-spot \
--labels=cml-vm
train-model:
needs: [deploy-runner]
tags:
- cml-vm
image: ubuntu:latest
script:
- echo "hello"
I set up
- AZURE_CLIENT_ID
- AZURE_CLIENT_SECRET
- AZURE_SUBSCRIPTION_ID
- AZURE_TENANT_ID
- REPO_TOKEN as well as PERSONAL_ACCESS_TOKEN as I found the documentation confusing about this (https://cml.dev/doc/self-hosted-runners?tab=GitLab#personal-access-token)
Unfortunately this results in:
$ cml runner \ # collapsed multi-line command
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" {\n + cloud = \"azure\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"gitlab\"\n + id = (known after apply)\n + idle_timeout = 300\n + instance_hdd_size = 35\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_type = \"s\"\n + labels = \"cml-vm\"\n + name = \"cml-elde6fnyv0\"\n + region = \"eu-west\"\n + repo = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n + single = false\n + spot = true\n + spot_price = -1\n + ssh_public = (known after apply)\n + token = (sensitive value)\n }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│ <head>\n│ <title>404 - Not Found</title>\n│ </head>\n│ <body>\n│ <h1>404 - Not Found</h1>\n│ </body>\n│ </html>\n│ Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│ with iterative_cml_runner.runner,\n│ on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│ 8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n","stack":"Error: terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" {\n + cloud = \"azure\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"gitlab\"\n + id = (known after apply)\n + idle_timeout = 300\n + instance_hdd_size = 35\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_type = \"s\"\n + labels = \"cml-vm\"\n + name = \"cml-elde6fnyv0\"\n + region = \"eu-west\"\n + repo = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n + single = false\n + spot = true\n + spot_price = -1\n + ssh_public = (known after apply)\n + token = (sensitive value)\n }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│ <head>\n│ <title>404 - Not Found</title>\n│ </head>\n│ <body>\n│ <h1>404 - Not Found</h1>\n│ </body>\n│ </html>\n│ Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│ with iterative_cml_runner.runner,\n│ on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│ 8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n\n at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n at ChildProcess.exithandler (node:child_process:406:5)\n at ChildProcess.emit (node:events:527:28)\n at maybeClose (node:internal/child_process:1092:16)\n at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
{"level":"info","message":"waiting 10 seconds before exiting..."}
I and my team could not understand what the problem is.
Additional info:
- I tried to use cml also with the
iterativeai/cml:0-dvc2-base1docker image - I tried to use Azure specific type and region, but no success
Any help would be very much appreciated.
Error message, extracted from the logs above
Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>404 - Not Found</title>
</head>
<body>
<h1>404 - Not Found</h1>
</body>
</html>
Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F
Hello, @francesco086! It sounds like an issue with your credentials. How did you get them?
Mmm... the error message is not very clear imho, it would be nice if it could be more explicit.
I will be off of work for the next days, will check the credentials again next week and come back with an update. For the time being, many thanks for the very fast reaction!!
Mmm... the error message is not very clear imho, it would be nice if it could be more explicit.
Unfortunately CML just communicates you what Azure tells...
Am I right in assuming that cml tries to create a resource group dedicated to the VM used for the runner? Is there any way to customize this behavior (i.e. supply an existing resource group)? Because I'm not convinced incorrect credentials are the whole story. Rather I suspect that CML tries to create the resource group, but fails (because the SP doesn't have the necessary role), and then tries to obtain an access token to a resource group (iterative-37d31qzqeb13b in our example) that doesn't exist (status code is 404 after all).
Am I right in assuming that cml tries to create a resource group dedicated to the VM used for the runner?
Yes
Is there any way to customize this behavior (i.e. supply an existing resource group)?
No, that's part of the cml runner hardcoded behavior
Because I'm not convinced incorrect credentials are the whole story. Rather I suspect that CML tries to create the resource group, but fails (because the SP doesn't have the necessary role), and then tries to obtain an access token to a resource group (iterative-37d31qzqeb13b in our example) that doesn't exist (status code is 404 after all).
Sounds plausible :+1:
See this guide and the permissions/az directory in the provider repository for a list of required permissions.
@0x2b3bfa0 thank you! Then I guess that is the source of the problem, I think we can close the issue.
For us (@lleiding is a colleague of mine) this may be a bit problematic. In our team we are trying to set things so to that each project has its own resource group. So what Ileiding asked could become a feature request: "Is there any way to customize this behavior (i.e. supply an existing resource group)?"
There's --cloud-permission-set and more docs coming soon in https://github.com/iterative/cml.dev/pull/242?
EDIT: not relevant; see below.
Is this new option for GCP and AWS only? no Azure?
In our team we are trying to set things so to that each project has its own resource group.
Do you have the possibility of having a separate subscription for every team instead?
So what Ileiding asked could become a feature request: "Is there any way to customize this behavior (i.e. supply an existing resource group)?"
Feel free to open a follow-up issue, although it's unlikely that we will implement it anytime soon. The current functionality relies heavily on the fact of deleting a whole resource group with a single API call. 😅
Is this new option for GCP and AWS only? no Azure?
Yes, still not supported on Azure, but please upvote & consider watching the following issues:
- https://github.com/iterative/terraform-provider-iterative/issues/235
- https://github.com/iterative/terraform-provider-iterative/issues/559
Note that --cloud-permission-set is not related to your issue, though: it's just to use managed identities inside your workflows.
Closing as per https://github.com/iterative/cml/issues/1019#issuecomment-1139778698; @francesco086, feel free to reopen this issue if you deem it opportune.
To recap @0x2b3bfa0 / @francesco086 the issue is in az nested resource groups arent supported and we are using a resource group to clean up all resources with a single API call, but here they have credentials that are only valid for a predefined resource group.
nested resource groups aren't supported
In the sense that they aren't even a thing: resource groups are a flat structure by design, and can't be nested.
Possible solutions
- Allow
cml runnerto create and delete resource groups at will, as long as their name matches a pattern[^1] - Use a subscription to isolate runners instead of a resource group, as suggested on cml#1019 (comment)
- Use a fixed deployment (i.e. the officially recommended solution); hard to implement
[^1]: Azure role assignment contitions are still in preview 🙃
@iterative/cml, any objections to wontfix for now?