a-guide-to-mlops
a-guide-to-mlops copied to clipboard
bug: CML Runner Registration
For the chapter 15, CML successfully creates the runner on GCP, however it hangs on the setup-runner step of the workflow.
Behaviour
- The cicd starts on GitHub
- CML creates the runner on GCP
- The
setup-runnerstep hangs on Terraform waiting:level":"info","message":"iterative_cml_runner.runner: Still creating... - After 5-7mins, the GCP pod auto-terminates
- The GitHub workflow is still hanging with Terraform at the
setup-runnerstep
Below is the output of the runner pod:
> kubectl logs -f cml-bo4s2uhzqs-2qx6z08y-ig1rgwq0-lg67g
Failed to get unit file state for cml.service: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 84.5M 100 84.5M 0 0 28.4M 0 0:00:02 0:00:02 --:--:-- 37.8M
bash: line 24: lsof: command not found
{"level":"info","message":"POST /repos/leonardcser/mlops-test/actions/runners/registration-token - 201 in 275ms"}
{"level":"info","message":"GET /repos/leonardcser/mlops-test/actions/runners?per_page=100 - 200 in 215ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.4"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/leonardcser/mlops-test/actions/runners/registration-token - 201 in 317ms"}
{"date":"2023-08-03T09:15:06.304Z","level":"info","message":"runner status","repo":"https://github.com/leonardcser/mlops-test","status":"ready"}
{"level":"info","message":"Unregistering runner cml-bo4s2uhzqs-2qx6z08y-ig1rgwq0..."}
{"level":"info","message":"GET /repos/leonardcser/mlops-test/actions/runners?per_page=100 - 200 in 277ms"}
{"level":"info","message":"DELETE /repos/leonardcser/mlops-test/actions/runners/23 - 204 in 360ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}
This output is similar to this issue on CML: https://github.com/iterative/cml/issues/1332
I can confirm having the same issue on my side. I don't have a clue why it doesn't work anymore but I'll let you know when I've found something.
@rmarquis, @leonardcser, I have added a new comment to the CML issue I have opened last year regarding this issue that you can find here: https://github.com/iterative/cml/issues/1415#issuecomment-1969077905.