bazel Multiplex worker slots exhausted with dynamic execution and worker cancellation enabled

Description of the bug:

We’re occasionally seeing an issue with our builds wherein builds are stalling and lots of [Sched] messages are being outputted, but nothing is actually being built. For context, we’ve enabled the following relevant Bazel features:

Persistent workers
Multiplex workers
Multiplex sandboxed workers
Worker cancellation
Remote build execution
Dynamic execution

When we take a thread dump of the Bazel server and our workers (both internal workers and workers not written by us, like Javac), we see that Bazel is waiting on responses from the workers and the workers are waiting for requests—there’s essentially a deadlock. I’ve attached a few of these thread dumps for reference.

thread5.txt thread4.txt thread3.txt thread2.txt thread1.txt

Even more interesting is that we see threads like these in the Bazel server:

"AsyncFinish-Worker-12" #15004 [2595775] prio=5 os_prio=0 cpu=0.23ms elapsed=2412.09s tid=0x00007f5a5c03c960 nid=2595775 waiting on condition  [0x00007f56fb55c000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x00000001b6677180> (a java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/Unknown Source)
    at java.util.concurrent.Semaphore.acquire([email protected]/Unknown Source)
    at com.google.devtools.build.lib.worker.WorkerMultiplexer.getResponse(WorkerMultiplexer.java:382)
    at com.google.devtools.build.lib.worker.WorkerProxy.getResponse(WorkerProxy.java:94)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.lambda$finishWorkAsync$0(WorkerSpawnRunner.java:642)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner$$Lambda/0x000000005cc655b8.run(Unknown Source)
    at java.lang.Thread.runWith([email protected]/Unknown Source)
    at java.lang.Thread.run([email protected]/Unknown Source)

These are the reaper threads responsible for awaiting the responses to work requests cancelled because of a lost dynamic execution race. I should also note that we observe this issue far less frequently when worker cancellation is disabled.

After doing some digging, I think the issue lies in a race condition in Bazel’s code:

Bazel executes an action via persistent worker with dynamic execution
One branch beats the other
Bazel cancels the losing branch
An InterruptedException is thrown in the WorkerMultiplexer and the request is removed from responseChecker, causing the semaphore for it to be lost
Bazel receives a response for the work request, but can’t release the semaphore for it because it’s been removed from responseChecker
1. We’re in fact seeing a lot of “Multiplexer for ... found no semaphore" messages in our build logs, which leads me to believe this is the case
Bazel spawns a reaper thread for the work request to collect the response
Bazel issues a cancellation request
Bazel awaits the response, which necessitates acquiring the semaphore for the request, but that semaphore will never be fulfilled because our worker has already sent a response (thereby releasing the old, dropped semaphore)
The reaper thread waits indefinitely and holds on to one of the limited instances for the multiplex worker
This race condition happens repeatedly, causing builds to slow down and eventually grind to a halt as worker slots are exhausted
1. We’re in fact seeing that all 8 of our allotted worker slots have been consumed when this error occurs

I’m not very familiar with this code, but I think the simplest solution is to have WorkerMultiplexer only remove the work request from requestChecker if an InterruptedException (or CancellationException?) wasn’t thrown.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don’t have a minimum reproducibility case now, but I’m happy to write one if needed. I should note that we’re able to reproduce the issue much more consistently by having the reaper thread wait after it’s spawned:

diff --git a/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java b/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
index 7ef16da893b..378ac308633 100644
--- a/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
+++ b/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
@@ -635,6 +635,10 @@ final class WorkerSpawnRunner implements SpawnRunner {
             () -> {
               resourceManager.acquireResourceOwnership();
 
+              try {
+                Thread.sleep(10000);
+              } catch (InterruptedException exception) {}
+
               Worker w = worker;
               try {
                 if (canCancel) {

Which operating system are you running Bazel on?

Ubuntu 24.04.2 LTS

What is the output of `bazel info release`?

release 8.2.1

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

We found issue #25232, but confirmed that setting --noremote_cache_async doesn't fix the issue.

Any other information, logs, or outputs that you want to share?

Nope.

Jun 13 '25 15:06 jadenPete

@bigelephant29 What do you think of the suggested fix?

Even without dynamic execution, I frequently see the warning about a missing semaphore when I interrupt a build that uses multiplex workers.

Jun 13 '25 15:06 fmeum

Thanks for the detailed explanation for the steps to reproduce.

Would you please give it a try and see if 64d89920d655c3408616ddd6b873f2682a744117 fixes the issue?

It would be super helpful if you can write a minimum reproducibility case as you mentioned. 😃

Jun 16 '25 09:06 bigelephant29

@bazel-io fork 8.3.0

Jun 16 '25 12:06 fmeum

Thank you for putting up a fix! I'll test it now. I'll also work on writing a minimum reproducability case.

Jun 16 '25 21:06 jadenPete

Alrighty, I've assembled a minimum reproducibility case and have confirmed that @bigelephant29's commit fixes the issue! https://github.com/lucidsoftware/worker-slots-exhausted-bug-repro

Thank you all for your help.

Jun 17 '25 15:06 jadenPete

Thanks for the confirmation! I'll initiate a fix from our internal codebase.

Jun 18 '25 11:06 bigelephant29

FYI, I tried to apply the change internally but tests were failing. This might not be a trivial change to fit the new behavior in the existing tests.

It'll take me a while to ship this out. I don't think I can make it in the 8.3.0 release. Please expect a slight delay here 😃

Jun 18 '25 13:06 bigelephant29

This has been merged into 8.4.0. #26478

Please let us know if you're still having issues with the patch.

Jul 08 '25 07:07 bigelephant29

A fix for this issue has been included in Bazel 8.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.4.0rc1. Thanks!

Aug 21 '25 19:08 iancha1992

Multiplex worker slots exhausted with dynamic execution and worker cancellation enabled

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?