bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Multiplex worker slots exhausted with dynamic execution and worker cancellation enabled

Open jadenPete opened this issue 6 months ago • 3 comments

Description of the bug:

We’re occasionally seeing an issue with our builds wherein builds are stalling and lots of [Sched] messages are being outputted, but nothing is actually being built. For context, we’ve enabled the following relevant Bazel features:

  • Persistent workers
  • Multiplex workers
  • Multiplex sandboxed workers
  • Worker cancellation
  • Remote build execution
  • Dynamic execution

When we take a thread dump of the Bazel server and our workers (both internal workers and workers not written by us, like Javac), we see that Bazel is waiting on responses from the workers and the workers are waiting for requests—there’s essentially a deadlock. I’ve attached a few of these thread dumps for reference.

thread5.txt thread4.txt thread3.txt thread2.txt thread1.txt

Even more interesting is that we see threads like these in the Bazel server:

"AsyncFinish-Worker-12" #15004 [2595775] prio=5 os_prio=0 cpu=0.23ms elapsed=2412.09s tid=0x00007f5a5c03c960 nid=2595775 waiting on condition  [0x00007f56fb55c000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x00000001b6677180> (a java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/Unknown Source)
    at java.util.concurrent.Semaphore.acquire([email protected]/Unknown Source)
    at com.google.devtools.build.lib.worker.WorkerMultiplexer.getResponse(WorkerMultiplexer.java:382)
    at com.google.devtools.build.lib.worker.WorkerProxy.getResponse(WorkerProxy.java:94)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner.lambda$finishWorkAsync$0(WorkerSpawnRunner.java:642)
    at com.google.devtools.build.lib.worker.WorkerSpawnRunner$$Lambda/0x000000005cc655b8.run(Unknown Source)
    at java.lang.Thread.runWith([email protected]/Unknown Source)
    at java.lang.Thread.run([email protected]/Unknown Source)

These are the reaper threads responsible for awaiting the responses to work requests cancelled because of a lost dynamic execution race. I should also note that we observe this issue far less frequently when worker cancellation is disabled.

After doing some digging, I think the issue lies in a race condition in Bazel’s code:

  1. Bazel executes an action via persistent worker with dynamic execution
  2. One branch beats the other
  3. Bazel cancels the losing branch
  4. An InterruptedException is thrown in the WorkerMultiplexer and the request is removed from responseChecker, causing the semaphore for it to be lost
  5. Bazel receives a response for the work request, but can’t release the semaphore for it because it’s been removed from responseChecker
    1. We’re in fact seeing a lot of “Multiplexer for ... found no semaphore" messages in our build logs, which leads me to believe this is the case
  6. Bazel spawns a reaper thread for the work request to collect the response
  7. Bazel issues a cancellation request
  8. Bazel awaits the response, which necessitates acquiring the semaphore for the request, but that semaphore will never be fulfilled because our worker has already sent a response (thereby releasing the old, dropped semaphore)
  9. The reaper thread waits indefinitely and holds on to one of the limited instances for the multiplex worker
  10. This race condition happens repeatedly, causing builds to slow down and eventually grind to a halt as worker slots are exhausted
    1. We’re in fact seeing that all 8 of our allotted worker slots have been consumed when this error occurs

I’m not very familiar with this code, but I think the simplest solution is to have WorkerMultiplexer only remove the work request from requestChecker if an InterruptedException (or CancellationException?) wasn’t thrown.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don’t have a minimum reproducibility case now, but I’m happy to write one if needed. I should note that we’re able to reproduce the issue much more consistently by having the reaper thread wait after it’s spawned:

diff --git a/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java b/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
index 7ef16da893b..378ac308633 100644
--- a/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
+++ b/src/main/java/com/google/devtools/build/lib/worker/WorkerSpawnRunner.java
@@ -635,6 +635,10 @@ final class WorkerSpawnRunner implements SpawnRunner {
             () -> {
               resourceManager.acquireResourceOwnership();
 
+              try {
+                Thread.sleep(10000);
+              } catch (InterruptedException exception) {}
+
               Worker w = worker;
               try {
                 if (canCancel) {

Which operating system are you running Bazel on?

Ubuntu 24.04.2 LTS

What is the output of bazel info release?

release 8.2.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?


If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

We found issue #25232, but confirmed that setting --noremote_cache_async doesn't fix the issue.

Any other information, logs, or outputs that you want to share?

Nope.

jadenPete avatar Jun 13 '25 15:06 jadenPete

@bigelephant29 What do you think of the suggested fix?

Even without dynamic execution, I frequently see the warning about a missing semaphore when I interrupt a build that uses multiplex workers.

fmeum avatar Jun 13 '25 15:06 fmeum

Thanks for the detailed explanation for the steps to reproduce.

Would you please give it a try and see if 64d89920d655c3408616ddd6b873f2682a744117 fixes the issue?

It would be super helpful if you can write a minimum reproducibility case as you mentioned. 😃

bigelephant29 avatar Jun 16 '25 09:06 bigelephant29

@bazel-io fork 8.3.0

fmeum avatar Jun 16 '25 12:06 fmeum

Thank you for putting up a fix! I'll test it now. I'll also work on writing a minimum reproducability case.

jadenPete avatar Jun 16 '25 21:06 jadenPete

Alrighty, I've assembled a minimum reproducibility case and have confirmed that @bigelephant29's commit fixes the issue! https://github.com/lucidsoftware/worker-slots-exhausted-bug-repro

Thank you all for your help.

jadenPete avatar Jun 17 '25 15:06 jadenPete

Thanks for the confirmation! I'll initiate a fix from our internal codebase.

bigelephant29 avatar Jun 18 '25 11:06 bigelephant29

FYI, I tried to apply the change internally but tests were failing. This might not be a trivial change to fit the new behavior in the existing tests.

It'll take me a while to ship this out. I don't think I can make it in the 8.3.0 release. Please expect a slight delay here 😃

bigelephant29 avatar Jun 18 '25 13:06 bigelephant29

This has been merged into 8.4.0. #26478

Please let us know if you're still having issues with the patch.

bigelephant29 avatar Jul 08 '25 07:07 bigelephant29

A fix for this issue has been included in Bazel 8.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.4.0rc1. Thanks!

iancha1992 avatar Aug 21 '25 19:08 iancha1992