llvm-zorg icon indicating copy to clipboard operation
llvm-zorg copied to clipboard

Accept new requests on SVE builders only if idle

Open luporl opened this issue 1 year ago • 7 comments

Some of Linaro's SVE builders are suffering from starvation, because there is currently no way to limit how many builds a given builder can start simultaneously. As there are 4 SVE builders that may use any of the 4 G3 workers, sometimes some builders use multiple workers while others end up with none available. This is aggravated by the fact that the same builders may use more than one worker for several times in a row, causing other builders to starve (clang-aarch64-sve-vls, for instance, has been idle for over 3 days once).

The buildbot documentation (2.5.2.8. Prioritizing Builders) says that builds are started in the builder with the oldest pending request, but it seems this is not working, possibly because the collapse requests feature ends up resetting submittedAt time.

While there is a way to limit how many builds a single worker can run, there is currently no way to limit how many builds a builder can be running on a pool of workers.

As shown below, this PR aims to enable us to prevent Builder A from building on Worker 1 and Worker 2 at the same time. Starving B of resources.

Builder A -->|------------|
           | | Worker 1   |
           | |------------|
           |
Builder B -->|------------|
             | Worker 2   |
             |------------|

This is accomplished by introducing the max_simultaneous_builds setting for builders. Its value specifies the maximum number of builds that a builder can have simultaneously. This limit is enforced using nextBuild, from BuilderConfig. With it, it was possible to limit Linaro's SVE builders to only one build at a time.

luporl avatar Dec 07 '23 19:12 luporl

Also the context that may be missing here (and worth adding to the PR message) is as I understand it, there is a way to limit how many builds a single worker can run, but not how many builds a builder can be running on a pool of workers.

Builder A ---->|------------|
               |   Worker   |
Builder B ---->|------------|

Here we are already able to make sure that A and B are not building on the worker at the same time.

Builder A -->|------------|
           | | Worker 1   |   
           | |------------|
           |   
Builder B -->|------------|
             | Worker 2   |   
             |------------|

This PR aims to enable us to prevent Builder A from building on worker 1 and worker 2 at the same time. Starving B of resources.

(if the diagrams make any sense, feel free to steal them :) )

DavidSpickett avatar Dec 08 '23 10:12 DavidSpickett

Also I thought I had sent you on a wild goose chase, when I remembered there is the collapse requests feature. Then I remembered that we're already using that (it defaults to True) on the SVE builders. So this is not a possible solution.

DavidSpickett avatar Dec 14 '23 17:12 DavidSpickett

This looks good for Linaro's intent, but let's see if @gkistanova can tell us if we're reinventing the wheel here.

DavidSpickett avatar Dec 14 '23 17:12 DavidSpickett

Are you really want/need to limit a number of builds of particular configuration running on the pool? I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.

Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?

Let me think about this.

gkistanova avatar Dec 18 '23 03:12 gkistanova

Are you really want/need to limit a number of builds of particular configuration running on the pool?

It's not really needed, but it's an easy way to avoid a particular configuration from using too many workers at the same time.

I mean, if there is an empty build queue for other builders, there is nothing wrong in using the whole pool of workers for a single build configuration, right? In this case there is no reason to keep some of the workers in the pool idle and response tome for that build configuration longer.

Right.

Maybe instead of limiting, we shall reconsider how the build jobs get scheduled on workers to make sure we actually distribute available workers fairly?

This would indeed be the best solution. I just don't know how to implement it :)

luporl avatar Jan 03 '24 19:01 luporl

Buildbot scheduler has been fixed. Now it should handle this case properly. Could you see how it works for you, please? If everything works as expected let's close this PR if you don't mind.

gkistanova avatar Jun 18 '24 21:06 gkistanova

I am afraid Linaro's SVE builders are still not getting a fair schedule of the available workers. For instance, right now clang-aarch64-sve-vla-2stage is using 3 workers, clang-aarch64-sve-vls-2stage is using 1 and no worker was allocated to clang-aarch64-sve-vls in the past 12 hours. If we look at its builds, we can see that more than 22 hours passed between builds 115 and 116.

luporl avatar Jun 19 '24 18:06 luporl