runner icon indicating copy to clipboard operation
runner copied to clipboard

Add ability to prioritize GitHub Action runners

Open gajus opened this issue 3 years ago • 40 comments

Describe the enhancement

Self-hosted GitHub actions should have an attribute (weight) that allows to prioritize them, i.e. If there are multiple idle runners with matching labels, then the weight attribute would determine which runner to use first, e.g. prioritized in ascending order.

Additional information

For context, the reason this is needed is because the current implementation randomly picks an available runner. However, imagine that you are scaling up and down runners depending on how long they have been idle. Using random allocation mechanism, there is no way to determine (efficiently) how long the runner was not in use. As a result, we have a large portion of VMs runnings that are not in use most of the time.

Prioritization would allow more efficient resource packing.

gajus avatar Feb 07 '22 16:02 gajus

Hi @gajus, Thanks for the reported issue and idea. I'm adding a future label, so that we can work on developing this in near future.

ruvceskistefan avatar Feb 08 '22 07:02 ruvceskistefan

@ruvceskistefan Do you have any update on this? This severely impacts our ability to scale GitHub Runners, costing literally tends of thousands monthly.

gajus avatar Feb 23 '22 23:02 gajus

Hi @gajus - can you tell me more about this? It sounds like you're rolling your own autoscaling solution? Does the ephemeral runner option not give you enough control here?

ethomson avatar Feb 24 '22 15:02 ethomson

The original issue text already includes description of how we orchestrate runners.

Whether we use ephemeral option or not, the problem is that there is no way to prioritize which runners will be picked up first. This means that if we have 100 idle runners and we have 20 jobs, then we have no way to say that these 20 should idle runners should be used first. Without weight of such for sorting, this means that a large number of idle runners is just sitting waiting for jobs because they keep getting random jobs (which they would otherwise not get if there was an order assigned).

gajus avatar Feb 24 '22 17:02 gajus

What I don't understand yet is whether these 20 runners in your example are different in some meaningful way that you really want to route to these 20? In other words, does it matter which 20 runners are getting jobs, or do you just want to scale down to any 20 runners? You mention "more efficient resource packing" but I don't feel like I have the full picture yet.

ethomson avatar Feb 24 '22 18:02 ethomson

Chiming in that we have a pool of machines that we use as runners, though some machines run significantly faster than others (reducing build time significantly). Ideally we want to be able to prioritize allocating jobs to the faster machines to reduce build times, but want to keep the slower machines active so they could pickup jobs while the faster ones are busy. Prioritization would be very beneficial here.

al2114 avatar Feb 26 '22 00:02 al2114

Thanks @al2114 - are you running static runners? I can understand the need for some more advanced routing there, but I'm trying to better understand the need for weights when auto scaling, or when using some sort of control plane.

ethomson avatar Feb 26 '22 00:02 ethomson

What I don't understand yet is whether these 20 runners in your example are different in some meaningful way that you really want to route to these 20?

No. All machines are identical.

Here is a simple task. You have 100 machines. You have 20 jobs that start every minute and complete in a minute.

What happens in the current setup?

Every minute a random 20 machines will get picked from the pool.

Why is that bad?

Machines that have not been used for 10 minutes are automatically removed from the pool. If resources are randomly assigned, then machines that otherwise would not need to have been used are being used. Therefore, you will always have 100 machines running even though 20 would suffice.

What's desired?

A way to prioritize which machines should get picked first. This way, the oldest machines (as an example) will always get used first and the rest will soon timeout and disconnect from the pool.

gajus avatar Feb 28 '22 01:02 gajus

your autoscaling solution is probably a bit too naive.

your assumption that the jobs are assigned is also wrong (I'm pretty sure). jobs are pulled by the runners, and not pushed to them. each runner periodically queries to see if any work is available. the first runner to pull that info after it is available gets it, and does the work. in order for a priority system to be implemented, all runners you host would need to talk to each other to know who should poll next.

the GitHub Actions Runner system, in its entirety, is making the assumption that the runner virtual machines are used a single time, then destroyed and replaced with a fresh VM when needed. solutions which do not keep that assumption in mind are going to have a difficult time adjusting to how GHA works.

I wrote an orchestrator in Go which uses workflow_job payloads exclusively to know when to destroy runner VMs and to know when to bring more online. everything is very smooth since I accepted that single-use runners (GitHub call them ephemeral runners) were the correct approach, and stopped fighting it.

naikrovek avatar May 05 '22 15:05 naikrovek

I would like something kind of relevant to this.

I would like to be able to give the runners the ability to opt out of polling, based on some health check. In my instance, I have a a few VMs, each with a few runners running, and I want a runner to be able to recognize that there is, lets say 95% memory usage, and not pick up a job. This will allow a runner on a less congested VM to pick it up. Right now, sometimes a congested VM will still pick up the job, and then oom.

This would not require runners communicating with each other, but basically just polling some endpoint, either through like curl or some file/socket, and if the number is 1 then pick up the job, if 0 then don't

Would be super helpful in distributing jobs

jharris-tc avatar Jul 21 '22 20:07 jharris-tc

As another example for this, we have M1 and Intel self hosted Mac runners.

The M1s are so much faster that we'd love a way to give them priority over the intel runners and only send jobs to the intel runners if all the M1 runners are busy.

The weight solution would work but really anything that allows us to set precedence would be great.

idyll avatar Aug 10 '22 15:08 idyll

I am also interested in this. This is my generic question which lead me here: https://github.com/orgs/community/discussions/30693

This issue seems like a great place to add more context and make it specific so that you can better determine if this is a legit +1

Our Test Universe workflow is highly variable, but it usually takes ~20 minutes to complete. When we run it locally, it completes within 5 minutes:

image

When we parallelise, it completes in less than 2 minutes. That is a significant 10x speed-up. We cannot parallelise it in GitHub Actions because we hit the 7GB memory limit (context). We would like to use self-hosted GitHub Runners in order to achieve this 10x speed-up.

If our own self-hosted GitHub runners are not available (busy, offline, etc.), free GitHub Runners should pick up those jobs. Currently, if we were to use jobs.<job_id>.runs-on: self-hosted, we would be excluding the free GitHub Runners.

I have two questions:

  1. Would it make sense to add jobs.<job_id>.runs-on-preferred: [self-hosted, ubuntu-latest] ?
  2. Can you think of a different way of achieving the above without this feature?

Thank you!

gerhard avatar Aug 25 '22 12:08 gerhard

The new larger GitHub Actions hosted runners makes my previous comment a non-issue. This new feature made a huge positive different for us already: https://github.com/dagger/dagger/pull/3277#issuecomment-1270640486. Great job everyone! 🤘

gerhard avatar Oct 07 '22 12:10 gerhard

Interested this feature, too! Maybe there are some need to change on actions-runner-controller either, like scaledown hook or relocate runner pods. But! I wonder the change from this feature. 🙂

Thanks for every contributors. 🤘

SPONGE-JL avatar Jan 05 '23 13:01 SPONGE-JL

I'm also interested in this. We are using self-hosted runners to provide different testing hardware environments and therefore have some runners with few labels and some runners with many labels. Our issue is that it happens quite often that jobs with fewer labels get picked up by runners with many labels and therefore the jobs that needs more labels have to be queued. If a weight or priority could be put on the runners with the fewest labels, then we can send the jobs there, and only use the ones with many labels if the rest are full, therefore more likely leaving room for when jobs that require many labels appear. We don't want to exclude the jobs with only a couple of labels from some runners, but want them to not randomly take up capacity when a more suitable place is available.

Martiix avatar Jan 16 '23 12:01 Martiix

I also support this use case. With a big heterogenous runner pool (100+ runners with 2-64 core CPUs) with lots of label variations and attached HW it is very hard to utilize all HW efficiently both in high-load and low-load scenarios.

Ideally, the load balancer should use the history of jobs and the history of runners to do dynamic scheduling.

Doing hard-coded weighting, as proposed here, is going to be hard to do correctly at scale, and maintain it when jobs change characteristics or when new types of runners are added. For this solution to work, I think at least it must be exposed in an API so that it is possible to re-weight all runners programmatically at a schedule.

Still, it would be a more impactful feature if GitHub could schedule for us automatically.

nedrebo avatar Jan 16 '23 13:01 nedrebo

Adding this enhancement would make a lot of sense and help the users a lot. For example, we can prioritize the runners with lesser latency first (we can label it based on location of those VMs) and the other runners (VMs placed elsewhere) as second priority.

vallabbharath avatar Jan 19 '23 06:01 vallabbharath

Adding my 2 cents here.

It would be great if we could remove the default labels (self-hosted, linux, x64 for example) attached to runners. In a lot of cases, workflows only include the self-hosted label to target runners. So it doesn't matter what custom labels we set and what pools of self hosted runners we manage, a random runner may pick up the job due to non unique label sets defined in the workflow.

For example, if you have a set of runners with custom labels like:

  • gpu
  • medium-memory

And another with:

  • fpga
  • high-memory

And the workflow author defines just self-hosted (because it's easy and the documentation encourages authors to use it to target self hosted runners), you may get a runner with a gpu when in fact you required one with fpga. Yes, org/enterprise members could be encouraged to target runners using only user defined labels, but when you have hundreds of teams, this can become quite tedious.

Runner groups are only available to enterprise users. Removing the default labels would be useful even for single repos with more collaborators and free tier orgs that a lot of open source projects use.

Allowing us to remove the default labels will give us the ability to define unique label sets and thus schedule jobs more efficiently. It also allows us to better react to queued workflow webhook and pick the right runner type to spin up if we have automation tools that allow us to define multiple pools with different characteristics (like detailed above). This way we don't need to spin up idle runners. We could just spin up one when a queued event is detected. But we can't do that efficiently if we have multiple runner types we define and the workflow just targets self-hosted. We potentially end up with the wrong runner type.

Hoping this makes sense :smile: .

gabriel-samfira avatar Feb 16 '23 10:02 gabriel-samfira

It would be great if we could remove the default labels (self-hosted, linux, x64 for example) attached to runners.

An unsupported way to remove the default labels is to delete them from the configuration function.

https://github.com/actions/runner/blob/982784d704cd9dafb56457cba5dba5e4986d769f/src/Runner.Listener/Configuration/ConfigurationManager.cs#L531-L533

After you have deleted these 3 lines, compile the actions/runner and use it to configure all your runners.

Last time this worked just fine as long you have provided your own labels.

You don't have to worry about auto updates as long your runner is already configured, your label change has been stored online and won't change.

ChristopherHX avatar Feb 16 '23 13:02 ChristopherHX

An unsupported way to remove the default labels is to delete them from the configuration function.

Yup. I wanted to create a PR that adds a --no-default-labels knob :smile: . We wrote an auto scaler for self hosted runners which is used by a few folks, and it would help if we could tell them they are able to use the officially supported runners instead of a fork that may stop working at some point.

gabriel-samfira avatar Feb 16 '23 13:02 gabriel-samfira

if a workflow author is not specific with their requirements via the labels, that is on them, in my mind.

I would set up a webhook which sends all workflow runs to a tool which reads them and files a new issue on every repo which runs actions that only specify self-hosted.

our guidelines are to always specify the OS and CPU architecture in labels at a bare minimum.

naikrovek avatar Feb 17 '23 19:02 naikrovek

if a workflow author is not specific with their requirements via the labels, that is on them, in my mind.

Yes, it is on them, but in the meantime, they may end up needlessly consuming instances that are more expensive/scarce (like GPU enabled instances). It also makes it difficult to spin up the right instance types, on the right hierarchy level (repo vs org vs enterprise), on-demand.

In any case, I opened a PR here: https://github.com/actions/runner/pull/2443

It makes the default labels optional (by default they are added), while still ensuring at least one label is added to the runner.

It feels like better UX to add only the labels you want. If that includes the default labels, great. If not, also great.

gabriel-samfira avatar Feb 17 '23 20:02 gabriel-samfira

any news on the prioritization? seems like a core feature that's missing. sometimes there are many jobs waiting for machines, and many machines waiting for jobs. this shouldn't happen in any well-designed system

xucian avatar Feb 20 '23 22:02 xucian

yes, the PR https://github.com/actions/runner/pull/2443 is good one. But prioritization is definitely much needed feature.

A workflow author should be able to say "Use runners with these labels if they are available, if they are not available, use runners with another label". Currently that's not possible. As 'xucian' mentioned in previous comment, either we have to wait for the high-resource runners without utilizing the priority 2 (low-resource ) runners if we set this strictly to match one of them. Or if we match the labels to match both of them, we have to live with the compromise of not utilizing the high-resource runner 50% of the time, even though it might be available.

vallabbharath avatar Feb 21 '23 13:02 vallabbharath

I think it would be even more useful to define required labels when configure/register a new one. e.g. ./config.sh --labels "ubuntu,large:required" .... With this config runners should act like following:

  • A Job with runs-on: ['self-hosted', 'ubuntu', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'ubuntu'] would not be executed on that runner
  • A Job with runs-on: ['self-hosted'] would not be executed on that runner

qoomon avatar Apr 17 '23 16:04 qoomon

I think it would be even more useful to define required labels when configure/register a new one. e.g. ./config.sh --labels "ubuntu,large:required" .... With this config runners should act like following:

  • A Job with runs-on: ['self-hosted', 'ubuntu', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'large'] would be executed on that runner
  • A Job with runs-on: ['self-hosted', 'ubuntu'] would not be executed on that runner
  • A Job with runs-on: ['self-hosted'] would not be executed on that runner

good idea, finer-grained (optional) control is always welcome. before that, we can just have the positions of the labels to implicitly denote priority. I think 99% of the limitations will be solved this way

xucian avatar Apr 18 '23 08:04 xucian

Any update on this ticket? I want to prioritize M2 Pro machines instead of M1 machines (for example).

Kaspik avatar Aug 01 '23 17:08 Kaspik

if there were updates on this issue you would see updates right here.

naikrovek avatar Aug 01 '23 20:08 naikrovek

Im in the same boat. I'd like to prioritise certain servers over others as build times can vary as much as 5x depending on the server.

QuixThe2nd avatar Aug 03 '23 20:08 QuixThe2nd

The only way to do this currently is to label your larger runners with different labels than your smaller runners. And once you do this, your users will discover a new way to make you lose the fight, and everyone will choose the larger runners because they're faster.

naikrovek avatar Aug 04 '23 15:08 naikrovek