llvm-zorg
llvm-zorg copied to clipboard
Accept new requests on SVE builders only if idle
Some of Linaro's SVE builders are suffering from starvation, because there is currently no way to limit how many builds a given builder can start simultaneously. As there are 4 SVE builders that may use any of the 4 G3 workers, sometimes some builders use multiple workers while others end up with none available. This is aggravated by the fact that the same builders may use more than one worker for several times in a row, causing other builders to starve (clang-aarch64-sve-vls, for instance, has been idle for over 3 days once).
The buildbot documentation (2.5.2.8. Prioritizing Builders) says that builds are started in the builder with the oldest pending request, but it seems this is not working, possibly because the collapse requests feature ends up resetting submittedAt time.
While there is a way to limit how many builds a single worker can run, there is currently no way to limit how many builds a builder can be running on a pool of workers.
As shown below, this PR aims to enable us to prevent Builder A from building on Worker 1 and Worker 2 at the same time. Starving B of resources.
Builder A -->|------------|
| | Worker 1 |
| |------------|
|
Builder B -->|------------|
| Worker 2 |
|------------|
This is accomplished by introducing the max_simultaneous_builds setting for builders. Its value specifies the maximum number of builds that a builder can have simultaneously. This limit is enforced using nextBuild, from BuilderConfig. With it, it was possible to limit Linaro's SVE builders to only one build at a time.
Also the context that may be missing here (and worth adding to the PR message) is as I understand it, there is a way to limit how many builds a single worker can run, but not how many builds a builder can be running on a pool of workers.
Builder A ---->|------------|
| Worker |
Builder B ---->|------------|
Here we are already able to make sure that A and B are not building on the worker at the same time.
Builder A -->|------------|
| | Worker 1 |
| |------------|
|
Builder B -->|------------|
| Worker 2 |
|------------|
This PR aims to enable us to prevent Builder A from building on worker 1 and worker 2 at the same time. Starving B of resources.
(if the diagrams make any sense, feel free to steal them :) )
Also I thought I had sent you on a wild goose chase, when I remembered there is the collapse requests feature. Then I remembered that we're already using that (it defaults to True) on the SVE builders. So this is not a possible solution.
This looks good for Linaro's intent, but let's see if @gkistanova can tell us if we're reinventing the wheel here.
Are you really want/need to limit a number of builds of particular configuration running on the pool? I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.
Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?
Let me think about this.
Are you really want/need to limit a number of builds of particular configuration running on the pool?
It's not really needed, but it's an easy way to avoid a particular configuration from using too many workers at the same time.
I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.
Right.
Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?
This would indeed be the best solution. I just don't know how to implement it :)
Buildbot scheduler has been fixed. Now it should handle this case properly. Could you see how it works for you, please? If everything works as expected let's close this PR if you don't mind.
I am afraid Linaro's SVE builders are still not getting a fair schedule of the available workers. For instance, right now clang-aarch64-sve-vla-2stage is using 3 workers, clang-aarch64-sve-vls-2stage is using 1 and no worker was allocated to clang-aarch64-sve-vls in the past 12 hours. If we look at its builds, we can see that more than 22 hours passed between builds 115 and 116.