agent
agent copied to clipboard
Distribute jobs more evenly across hosts
Is your feature request related to a problem? Please describe.
We have hosts that can have different numbers of spawned agents. The priority for these is set by the spawn ID with the spawn-with-priority option.
The way that priorities work now are that higher numbered priorities are used first.
If hostA
has 1 spawned agent running and hostB
has 3 spawned agents running, hostB
is going to be running at least 2 or maybe 3 tests while hostA
is sitting idle waiting for jobs to be assigned to it.
Assuming all agents are idle, the order that jobs are assigned is:
-
hostB agent3
-
hostB agent2
-
hostA agent1
orhostB agent1
-
hostA agent1
orhostB agent1
(whichever was not given a job before)
Describe the solution you'd like
The next agent would be chosen based on the spawned agent utilisation of each host.
hostA
has 1 spawned agent with 1 job running (100% utilisation)
hostB
has 3 spawned agents with 1 job running (33% utilisation)
hostC
has 5 spawned agents with 1 job running (20% utilisation)
The next host to be assigned work would be hostC
because the current utilisation is the lowest. The agent on hostC
that is given the work is determined based on the priority.
(Ideally that spawned agent prioritisation could also be flipped so hostC agent1
would be the first to be used instead of hostC agent5
. Having that as a configuration option would be ace! I can split that out into a separate feature request if needed.)
hostA
has 1 spawned agent with 1 job running (100% utilisation)
hostB
has 3 spawned agents with 1 job running (33% utilisation)
hostC
has 5 spawned agents with 2 jobs running (40% utilisation)
Now, with hostC
utilisation at 40%, the next host to be assigned a job would be hostB
.
Describe alternatives you've considered
I've spoken with Jarryd from Buildkite about this issue, but there doesn't appear to be any existing solutions for this use case.
Setting host priority doesn't work for situations where there are, say, two agents on a host. If that host is meant to be used first due to host priority, then the same situation would occur as the original problem, where one host is doing all the work while the other is sitting idle.
Additional context
We set the number of spawn agents in each host's config.
There are a variety of hardware profiles for our hosts, so some can only run one agent at a time, some run 3, and we're about to start trialling hosts that should be able to run 6 or more agents 🤞
Hi @nick-f thanks for your interest in the buildkite-agent! Apologies for taking a long time to get back to you.
The experience for running multiple agents on difference sized hosts is somewhat lacking, as you are finding. In particular, the backend scheduler is not fully aware of the assignment of agents to hosts. From its perspective, there are scheduled jobs and there are agents available to run those jobs, and it assigns the jobs to the agents without knowledge of how the agents are utilising their hosts. This decoupling keeps the scheduler simple, but an unfortunate side effect is situations where hosts are not being fully utilised.
The spawn-with-priority
option was an attempt to address this. However, as you are finding, it is not a complete solution. The place to fix this in the backend scheduler, and it currently not designed to schedule jobs with this in mind.
It would be a significant redesign of the scheduler to make it more aware of both the hosts and the worker agent, and while this is a paint point for a significant portion of our customers, it is also not a problem at all for others. So we are concentrating our efforts at the moment on running making the buildkite-agent runnable in a Kubernetes clusters. There, agents workers are spun up on demand, and we can take advantage of primitives offered in that ecosystem to bin pack jobs to host. So hopefully we will soon have a better story to tell in this space.
This decoupling keeps the scheduler simple, but an unfortunate side effect is situations where hosts are not being fully utilised.
If the priorities were flipped (i.e. agents with spawn priority 1 were used first, etc.) then that would at least give the ability to spread the load across all hosts. For my example situation, the extra agents on the more powerful hosts would be used as overflow, once all the other hosts' agents are in use. The priority as it is now doesn't allow for this.
So we are concentrating our efforts at the moment on running making the buildkite-agent runnable in a Kubernetes clusters. There, agents workers are spun up on demand, and we can take advantage of primitives offered in that ecosystem to bin pack jobs to host. So hopefully we will soon have a better story to tell in this space.
Unfortunately that won't help us at all with our use case (we're running iOS tests on physical Mac Minis) and doesn't seem to be related to this issue or a solution to it at all.
If there's somewhere else to submit this feedback to as well I'm happy to do it. Just let me know where it should go.
With the release of v3.45.0 and enabling the experimental flag, the load is being spread out across hosts now 🎉
I'll leave this open while #1967 is still open, but it's looking good so far. Thanks!
I've closed #1967, but I'm happy to leave this open while we decide how to de-experiment-ify (make descending-spawn-priority
the default? or re-use some ideas from #1967?)