bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

[Question] What Load Balance Strategy is Used by Builfram?

Open lixin-wei opened this issue 3 years ago • 5 comments

I'm wondering what load balance strategy is used by builfram?

How will buildfarm shuffle actions when pending actions are much more than the sum of execute_stage_width?

lixin-wei avatar Oct 26 '21 12:10 lixin-wei

For shard instances, buildfarm implements distributed queues in redis for its arrival and ready-to-run operations (execute requests for actions).

Workers pull operations from these queues as long as they have capacity in their pipelines (input fetch, execute, and report result stages).

When max_queue_depth is reached for the ready-to-run queue, transforms into that queue from the arrival queue are halted.

When max_prequeue_depth is reached for the arrival queue, execute requests are rejected with RESOURCE_EXHAUSTED.

Memory instances use a similar queue, but are not distributed or rate limited. We do not recommend use of the memory instance in high performance-requirement situations.

werkt avatar Oct 27 '21 03:10 werkt

@werkt Got it! Thanks! What's the default value of max_queue_depth and max_prequeue_depth? And I only found max_queue_depth in RedisShardBackplaneConfig. How to configure memory instance's queue size?

lixin-wei avatar Oct 27 '21 04:10 lixin-wei

One more question. Will worker pull operations from other workers if the server's queue is empty?

We have one operation that costs a lot of time. If it blocks one worker, all operations in this worker after it will be blocked. Can other workers help this worker if they are idle and there are no operations left in the server's queue?

lixin-wei avatar Oct 27 '21 05:10 lixin-wei

As quoted, memory instances are not rate limited, there are no controls on it.

Workers will not pull actions from other workers, their work is considered allocated to them until their lease expires, which is updated regularly while they remain in the known state of input fetch, waiting for space in the execute stage. There is discussion around making the input fetch stage dynamic based on feedback from the execute stage - if no progress is being made, we may trigger the workers to reduce the size of the input fetch stage, which could return work to the queue, prevent the occupation of the single input fetch slot, and return the input fetch width to a larger size when the executions cannot be saturated, but more investigation needs to take place for that to go into effect.

werkt avatar Oct 29 '21 03:10 werkt

I see, thank you for your explanation! I think rebalancing between workers is necessary. It will help the total throughput.

lixin-wei avatar Oct 29 '21 07:10 lixin-wei