cml attach multiple projects/runners to a machine

Dear CML-Team, I hope you can give me some advice for my use-case. I have multiple on-premise machines that can run pipelines for training when new data has been collected or when someone comes up with a new set of hyper-parameters. When training on a machine for a certain project, I do not want to use the same machine to train a model for a different project at the same time due to obvious reasons. However, I dont want to bind a machine to a project alone because then a machine would be idle for 90% of its time. I'd like to make all machines (or cml_runners) available to all projects. While that is possible by creating multiple cml-runners per machine, they do not share the "idle"/"active"-state with each other (at least by default).

How can I either:

Share the idle/active-state of a runner on a machine with all runners on the same machine
Use a single runner for multiple GitHub repositories
Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load

Thanks in advance!

Feb 08 '22 11:02 dnns92

Use a single runner for multiple GitHub repositories

As per #271/#277, you can use cml runner --repo https://github.com/{organization} to register a runner for all the repositories in {organization}

Feb 08 '22 11:02 0x2b3bfa0

Share the idle/active-state of a runner on a machine with all runners on the same machine

This isn't precisely easy to implement: every runner keeps its own internal state. Still, you can use cml runner --labels=example,one and runs-on: [self-hosted, example, one] on your workflow to select a specific runner.

Feb 08 '22 11:02 0x2b3bfa0

Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load

Definitely not possible out of the box, but you can probably script something to implement that behavior. 😅

Feb 08 '22 11:02 0x2b3bfa0

Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load

It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs

Feb 09 '22 09:02 DavidGOrtega

Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load

It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs

Runners only execute jobs one at a time. Theoretically for the runner to pick a job up while at 90% usage (presumably from processing another job) you would need to install the agent multiple times on the same instance? Which is in the realm of possibility for what you can do. Though I think I would try to argue that is outside of what cml runner should do. (you could invoke it twice using labels for your two jobs you want to run in parallel)

Mar 07 '22 23:03 dacbd

you could invoke it twice using labels for your two jobs you want to run in parallel

you could also invoke it N times with the same labels but different limits (e.g. different CUDA_VISIBLE_DEVICES, different CPU/RAM restrictions) effectively creating a pool of runners.

The parallel CI jobs will auto-pick from said pool.

Mar 08 '22 16:03 casperdcl

Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load

It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs

Runners only execute jobs one at a time. Theoretically for the runner to pick a job up while at 90% usage (presumably from processing another job) you would need to install the agent multiple times on the same instance? Which is in the realm of possibility for what you can do. Though I think I would try to argue that is outside of what cml runner should do. (you could invoke it twice using labels for your two jobs you want to run in parallel)

Dear @dacbd, Also this happens when you run experiments on a machine by hand and have the machine registered as a runner. In my case, we do not want to use a machine specifically for pipelines only, so we do both experiments by hand and pipelines on the same machine. We could shut down the runner every time we do experiments and run it again once we're finished.

Mar 08 '22 17:03 dnns92

shut down the runner every time we do experiments and run it again once we're finished.

Sounds like the best solution in your use case (on-prem mixed-use, i.e. local & runner)

Mar 08 '22 20:03 casperdcl

@dnns92 please re-open this issue if you have any further questions.

Feb 17 '23 15:02 dacbd