bazel-buildfarm
bazel-buildfarm copied to clipboard
support 'exclusive' test execution
Any test that is tagged "exclusive" can not run remotely on Buildfarm. Additionally, Buildfarm has no concept of "exclusive" test execution.
We propose the feature be made available as an execution property since --experimental_allow_tags_propagation
already forwards it.
It could be argued that any test requiring "exclusive" is already a bad test and needs to be fixed- yes. Nonetheless, there is an opportunity to migrate tests from local to remote while preserving their execution requirements.
Currently, we already do this with gpu actions by limiting the execution width on the worker. The solution is solved as-is by modifying the operation queue and matching to specific single slot workers. However, workers can't dynamically run things exclusively, and not all gpu tests need to run exclusively. So we think we can get better hardware utilization by increasing execution width and allowing exclusive tests to prevent parallelism when needed.
Perhaps when a worker fetches an exclusive test, it stops fetching other tests so the other tests aren't stuck waiting on the worker.
Thoughts?
I'm not totally sure what you mean here.
The "exclusive" tag on a test currently disables remote execution for its actions (consistently, insofar as I've seen), and only through exec properties does it get limited to the test action itself.
If you're suggesting that a characteristic of an action would be that it could run 'exclusively' as a bit, it would need something like a platform property to service it.
Beyond that, I would want to clarify the following behaviors based on your description. Feel free to correct any that don't apply, or should have additional controls:
- A worker can immediately upon dequeue (in match) determine action exclusivity
- A worker will wait for the entire concurrent set ahead of an exclusive action to be processed before beginning to process it (including input fetch)
- A worker will not match, fetch or execute an action until the exclusive action has exited the ReportResultStage
The "exclusive" tag on a test currently disables remote execution for its actions (consistently, insofar as I've seen), and only through exec properties does it get limited to the test action itself. If you're suggesting that a characteristic of an action would be that it could run 'exclusively' as a bit, it would need something like a platform property to service it.
That's my understanding. A migration for our tests could look like this:
# before
sh_test(
name = "test",
srcs = ["test.sh"],
tags = ["exclusive"]
)
# after
sh_test(
name = "test",
srcs = ["test.sh"],
exec_properties = {"exclusive": "true"}
)
Where the "exclusive" tag is removed to allow bazel to run the test remotely, and then added as an exec_property
so buildfarm can see it as a platform property and preserve the exclusiveness. I'd prefer if bazel just assumed the remote executor was capable of exclusive execution since we already have "no-remote"... but I suppose that's a bazel implementation and not as relevant for buildfarm.
- A worker can immediately upon dequeue (in match) determine action exclusivity
Yes, that would be a good place for the worker to check and prepare accordingly.
In fact, a worker may want to analyze the exclusivity of an action and decide whether to accept/reject it
(DequeueMatchEvaluator.shouldKeepOperation
comes to mind which may or may not be what we consider "matching").
But assuming it keeps the action, the worker would know upon dequeue.
- A worker will wait for the entire concurrent set ahead of an exclusive action to be processed before beginning to process it (including input fetch)
That's a good question in regards to granularity. My original intention is purely on the execution stage. We don't want two test actions running in parallel. I wouldn't mind an exclusive test fetching inputs while the other tests are running. And the exclusive test running while other fetches and report results are being performed.
- A worker will not match, fetch or execute an action until the exclusive action has exited the ReportResultStage
Similar to the above statement. I would be fine with either implementation. Whether its locking the execution stage only, or locking additional stages in the pipeline as well. There may be good reasons for making other stages exclusive, and it may also be a resource inefficiency to do so. I'd be curious which strategy you think is easier to implement and which one you think is the least misleading.
Agreed, the bazel side changes would be awkard, but maybe not much more awkward than how "exclusive" tags work already...
- A worker can immediately upon dequeue (in match) determine action exclusivity
Yes, that would be a good place for the worker to check and prepare accordingly. In fact, a worker may want to analyze the exclusivity of an action and decide whether to accept/reject it (
DequeueMatchEvaluator.shouldKeepOperation
comes to mind which may or may not be what we consider "matching"). But assuming it keeps the action, the worker would know upon dequeue.
There is a sense that some workers might be improper to permit exclusive execution on, and might even indicate either no-exclusive or exclusive-min-cores to couple them with action sizes. An additional policy might be a detrimental execution rate - locking up a 96 core machine after it executed 10000 actions in the last minute might make an exclusive action look like a bad idea.
- A worker will wait for the entire concurrent set ahead of an exclusive action to be processed before beginning to process it (including input fetch)
That's a good question in regards to granularity. My original intention is purely on the execution stage. We don't want two test actions running in parallel. I wouldn't mind an exclusive test fetching inputs while the other tests are running. And the exclusive test running while other fetches and report results are being performed.
Not waiting is fine by me - there's no policy that indicates an input-heavy action, and there is no reliable mechanism to determine what must be downloaded to execute that would make me think "exclusive" should mean as much.
- A worker will not match, fetch or execute an action until the exclusive action has exited the ReportResultStage
Similar to the above statement. I would be fine with either implementation. Whether its locking the execution stage only, or locking additional stages in the pipeline as well. There may be good reasons for making other stages exclusive, and it may also be a resource inefficiency to do so. I'd be curious which strategy you think is easier to implement and which one you think is the least misleading.
This I'll caution against - there's a larger discussion of whether a static size for input fetch width is appropriate - fewer things in the input fetch stage is the guard against more unnecessary waiting on a worker, when another might be able to execute it more expeditiously. One presumed nature of an exclusive test is that it might requires more time as a resource, in addition to a lack of contention, which puts it in a similar boat to our experience with min-cores holding up actions in input fetch. There is essentially no timeout for an action to sit in input fetch with a polling worker while waiting for its cpu slots.