bazel icon indicating copy to clipboard operation
bazel copied to clipboard

deadlock in worker resource management

Open benjaminp opened this issue 1 year ago • 5 comments

I was running a build with 463f80979127b39bb41c9afde5b5863914a7609d, and it hung. The stacktraces reveal what looks like a hang checking out a worker:

"skyframe-evaluator-execution-6" #8044 [81650] daemon prio=5 os_prio=0 cpu=54.16ms elapsed=273.90s tid=0x00007fa40800e470 nid=81650 waiting for monitor entry  [0x00007fa4600fc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.google.devtools.build.lib.actions.ResourceManager.release(ResourceManager.java:537)
        - waiting to lock <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.releaseResources(ResourceManager.java:485)
        at com.google.devtools.build.lib.actions.ResourceManager$ResourceHandle.close(ResourceManager.java:114)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:114)
        at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:471)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:251)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1169)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:567)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:900)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:190)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:468)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:426)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec([email protected]/ForkJoinTask.java:1423)
        at java.util.concurrent.ForkJoinTask.doExec([email protected]/ForkJoinTask.java:387)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec([email protected]/ForkJoinPool.java:1312)
        at java.util.concurrent.ForkJoinPool.scan([email protected]/ForkJoinPool.java:1843)
        at java.util.concurrent.ForkJoinPool.runWorker([email protected]/ForkJoinPool.java:1808)
        at java.util.concurrent.ForkJoinWorkerThread.run([email protected]/ForkJoinWorkerThread.java:188)

"skyframe-evaluator-execution-3" #8041 [81644] daemon prio=5 os_prio=0 cpu=40.43ms elapsed=273.91s tid=0x00007fa380025360 nid=81644 waiting on condition  [0x00007fa4626fc000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000002a20010d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:371)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:519)
        at java.util.concurrent.ForkJoinPool.compensatedBlock([email protected]/ForkJoinPool.java:3740)
        at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3723)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1707)
        at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:594)
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:355)
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:270)
        at com.google.devtools.build.lib.worker.SimpleWorkerPool.borrowObject(SimpleWorkerPool.java:87)
        at com.google.devtools.build.lib.worker.WorkerPoolImplLegacy.borrowObject(WorkerPoolImplLegacy.java:236)
        at com.google.devtools.build.lib.actions.ResourceManager.incrementResources(ResourceManager.java:442)
        - locked <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.acquire(ResourceManager.java:508)
        - locked <0x00000002904115a0> (a com.google.devtools.build.lib.actions.ResourceManager)
        at com.google.devtools.build.lib.actions.ResourceManager.acquireResources(ResourceManager.java:385)
        at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:413)
        at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:210)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
        at com.google.devtools.build.lib.rules.java.JavaCompileAction.execute(JavaCompileAction.java:413)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1169)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1082)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:567)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:900)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:190)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:468)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:426)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec([email protected]/ForkJoinTask.java:1423)
        at java.util.concurrent.ForkJoinTask.doExec([email protected]/ForkJoinTask.java:387)
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec([email protected]/ForkJoinPool.java:1312)
        at java.util.concurrent.ForkJoinPool.scan([email protected]/ForkJoinPool.java:1843)
        at java.util.concurrent.ForkJoinPool.runWorker([email protected]/ForkJoinPool.java:1808)
        at java.util.concurrent.ForkJoinWorkerThread.run([email protected]/ForkJoinWorkerThread.java:188)

benjaminp avatar May 29 '24 01:05 benjaminp

cc @wilwell @zhengwei143

meisterT avatar May 31 '24 07:05 meisterT

It looks like that one thread is waiting for the idle worker in borrowObject and meanwhile another can't return the worker in synchronized release .

By code logic it should not happen, because we should check that we have enough idle workers before grabbing the new one.

I saw that recenlty @zhengwei143 rewrote this code with new method hasAvailableQuota in cl/627316722 (commit 5742e69c6f8d5705c645b94ca8fe745370ef6c92). By the first view it doesn't seem incorrect with race conditions, but the code in ResourceManager and Workers is tricky. Could it be that this change caused the deadlock?

@benjaminp how often does it reproduce?

wilwell avatar Jun 03 '24 08:06 wilwell

This deadlock could be resolved with async return of the worker to the pool, but probably there is an error in the code logic, which triggers this deadlock

Probably we need to add the assert if we are trying to borrow new worker, but there is no free one.

wilwell avatar Jun 03 '24 08:06 wilwell

One thing I noticed is that the build that first the this deadlock printed the UI message about starting a new Javac multiplex worker. So, I'm wondering if the problem is all in the worker creation code being stuck when a worker is (re)created.

benjaminp avatar Jun 03 '24 16:06 benjaminp

@benjaminp could you please attach info log if you catch this behaviour again?

I suspect that functionality could have bugs in with multiplex workers

wilwell avatar Jun 06 '24 13:06 wilwell

I'm seeing this every single time on Bazel 8. https://github.com/bazelbuild/bazel/issues/25076 has a similar jstack and also the requested java.log.

fmeum avatar Jan 27 '25 09:01 fmeum

@bigelephant29 could you have a look at this and #25076? (and de-dup accordingly)

zhengwei143 avatar Jan 27 '25 09:01 zhengwei143

We've proposed a potential solution with specifying --experimental_use_new_worker_pool in #25076. Please give it a try and see if the problem goes away. If not, please let us know and re-open this issue.

I'm marking this issue as duplicate.

bigelephant29 avatar Jan 28 '25 09:01 bigelephant29