attach multiple projects/runners to a machine
Dear CML-Team, I hope you can give me some advice for my use-case. I have multiple on-premise machines that can run pipelines for training when new data has been collected or when someone comes up with a new set of hyper-parameters. When training on a machine for a certain project, I do not want to use the same machine to train a model for a different project at the same time due to obvious reasons. However, I dont want to bind a machine to a project alone because then a machine would be idle for 90% of its time. I'd like to make all machines (or cml_runners) available to all projects. While that is possible by creating multiple cml-runners per machine, they do not share the "idle"/"active"-state with each other (at least by default).
How can I either:
- Share the idle/active-state of a runner on a machine with all runners on the same machine
- Use a single runner for multiple GitHub repositories
- Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load
Thanks in advance!
Use a single runner for multiple GitHub repositories
As per #271/#277, you can use cml runner --repo https://github.com/{organization} to register a runner for all the repositories in {organization}
Share the idle/active-state of a runner on a machine with all runners on the same machine
This isn't precisely easy to implement: every runner keeps its own internal state. Still, you can use cml runner --labels=example,one and runs-on: [self-hosted, example, one] on your workflow to select a specific runner.
Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load
Definitely not possible out of the box, but you can probably script something to implement that behavior. 😅
Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load
It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs
Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load
It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs
Runners only execute jobs one at a time. Theoretically for the runner to pick a job up while at 90% usage (presumably from processing another job) you would need to install the agent multiple times on the same instance? Which is in the realm of possibility for what you can do. Though I think I would try to argue that is outside of what cml runner should do. (you could invoke it twice using labels for your two jobs you want to run in parallel)
you could invoke it twice using labels for your two jobs you want to run in parallel
you could also invoke it N times with the same labels but different limits (e.g. different CUDA_VISIBLE_DEVICES, different CPU/RAM restrictions) effectively creating a pool of runners.
The parallel CI jobs will auto-pick from said pool.
Ensure that a cml_runner cannot be picked by a GitHub-Action when the Machine is already running at e.g. 90% GPU-Load
It's an interesting feature. I have felt this pain, however is not yet very useful to be implemented due to timeouts of idle jobs
Runners only execute jobs one at a time. Theoretically for the runner to pick a job up while at 90% usage (presumably from processing another job) you would need to install the agent multiple times on the same instance? Which is in the realm of possibility for what you can do. Though I think I would try to argue that is outside of what
cml runnershould do. (you could invoke it twice using labels for your two jobs you want to run in parallel)
Dear @dacbd, Also this happens when you run experiments on a machine by hand and have the machine registered as a runner. In my case, we do not want to use a machine specifically for pipelines only, so we do both experiments by hand and pipelines on the same machine. We could shut down the runner every time we do experiments and run it again once we're finished.
shut down the runner every time we do experiments and run it again once we're finished.
Sounds like the best solution in your use case (on-prem mixed-use, i.e. local & runner)
@dnns92 please re-open this issue if you have any further questions.