Bazel stuck during remote execution
Description of the bug:
Bazel build current state
[65,456 / 65,465] PackageZip ...rcejar.jar; 20129s remote
when there is no such issue observed this run takes less than a minute if not less than 10s.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.
Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time
Which operating system are you running Bazel on?
5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
What is the output of bazel info release?
release 7.0.1
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
We do have the following in .bazelrc
common:remote --jobs=300
common:remote --remote_timeout=3600
scratch_45.txt here is the thread dump of Bazel server when this is happening
I recommend not using --experimental_remote_execution_keepalive.
Let us try out to remove this!
Should it TO even with keep alive though?
Still happening even without keep alive
cat .bazelrc |grep keep
#common:remote --experimental_remote_execution_keepalive
[65,553 / 65,563] .... 1966s remote
bazel_server_dump.txt
another dump with keep_alive off
FYI @werkt this is my colleague and we're using Buildfarm
From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything?
We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately.
Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there.
Another suspicious stack frame is:
"remote-executor-15" #3372 [22233] prio=5 os_prio=0 cpu=238.42ms elapsed=12851.39s allocated=100633K defined_classes=0 tid=0x00007fb6ac44d050 nid=22233 in Object.wait() [0x00007fb7562ac000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait0([email protected]/Native Method)
- waiting on <no object reference available>
at java.lang.Object.wait([email protected]/Object.java:366)
at java.lang.Object.wait([email protected]/Object.java:339)
at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:175)
- locked <0x000000042f67b158> (a com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver)
at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:253)
at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:89)
at java.io.BufferedOutputStream.flushBuffer([email protected]/BufferedOutputStream.java:125)
at java.io.BufferedOutputStream.implFlush([email protected]/BufferedOutputStream.java:252)
at java.io.BufferedOutputStream.flush([email protected]/BufferedOutputStream.java:240)
at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStream.flush(UiEventHandler.java:146)
at com.google.devtools.build.lib.util.io.AnsiTerminal.flush(AnsiTerminal.java:192)
at com.google.devtools.build.lib.runtime.UiEventHandler.handleLocked(UiEventHandler.java:373)
- locked <0x000000042f67b2f0> (a com.google.devtools.build.lib.runtime.UiEventHandler)
at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:443)
at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:471)
at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:127)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.reportUploadError(ByteStreamBuildEventArtifactUploader.java:342)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$uploadLocalFiles$5(ByteStreamBuildEventArtifactUploader.java:371)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda/0x00007fb75c9f5568.apply(Unknown Source)
at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext$ResumeMainSingleObserver.onError(SingleResumeNext.java:73)
at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1144)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
at java.lang.Thread.runWith([email protected]/Thread.java:1596)
at java.lang.Thread.run([email protected]/Thread.java:1583)
It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES?
BEP is async and this is confirmed to be working properly in an async way starting from 7.0
build --bes_backend=grpcs://...
build --bes_upload_mode=fully_async
build --build_event_text_file_upload_mode=fully_async
build --build_event_binary_file_upload_mode=fully_async
build --build_event_json_file_upload_mode=fully_async
Here is the full list of remote-related configs
common:remote_sfw_uswest2 --remote_executor=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --remote_cache=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --config=remote
# remote defaults.
common:remote --define=EXECUTOR=remote
common:remote --experimental_remote_cache_compression # https://github.com/buildbuddy-io/buildbuddy/blob/master/.bazelrc#L32
common:remote --jobs=300
common:remote --remote_timeout=3600
common:remote --build_metadata=RBE=True
common:remote --platforms=@rbe_default//config:platform
common:remote --host_platform=@rbe_default//config:platform
common:remote --extra_execution_platforms=@rbe_default//config:platform
common:remote --crosstool_top=@rbe_default//cc:toolchain
common:remote --extra_toolchains=@rbe_default//config:cc-toolchain
common:remote --disk_cache=
common:remote --cpu=k8 --host_cpu=k8 # force tools to use intel x86_64 CPU (in case you are trying on an Apple Mac)
#common:remote --experimental_remote_execution_keepalive
From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client.
I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck.
I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.
What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs.
❯ cat .bazelrc | grep -e bes -e event
#build --bes_backend=grpcs://...
#build --bes_upload_mode=fully_async
#build --build_event_text_file_upload_mode=fully_async
#build --build_event_binary_file_upload_mode=fully_async
#build --build_event_json_file_upload_mode=fully_async
I see
This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout.
I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for
async.
Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible.
From the new stack trace:
skyframe-evaluator-execution-150" #12989 [603421] daemon prio=5 os_prio=0 cpu=3342.37ms elapsed=1629.24s allocated=1920M defined_classes=0 tid=0x00007f7c5c24ee30 nid=603421 waiting on condition [0x00007f7b13432000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000056776f8d0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:756)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos([email protected]/AbstractQueuedSynchronizer.java:1126)
at java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:276)
at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingAwait(BlockingMultiObserver.java:134)
at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1497)
at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:117)
at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1451)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:272)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda/0x00007f7d0c88b4e8.call(Unknown Source)
at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:265)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
...
it seems like Bazel stuck at uploading inputs to remote server. Since you set --remote_timeout=3600, Bazel will only time out after 1 hour. Can you try with a lower value to check whether Bazel can time out?
It seems relate to --experimental_throttle_remote_action_building. Can you try with --noexperimental_throttle_remote_action_building?
As you can see here https://github.com/bazelbuild/bazel/issues/21626#issuecomment-1994457946 we don't have this enabled. I still can turn it off if required though.
Ok... I see it is enabled by default. Let me turn that off
@Option(
name = "experimental_throttle_remote_action_building",
defaultValue = "true",
What to do with BES/BEP and keep_alive that were turned off earlier ?
Let's keep them off for now.
This flag is 8.x only, based on https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4
I got
ERROR: --noexperimental_throttle_remote_action_building :: Unrecognized option: --noexperimental_throttle_remote_action_building
for
cat .bazelversion
7.0.1
So I'd say this is definitely not this property.
So I'd say this is definitely not this property.
Although the flag was not available for 7.0.1, the feature is still there and on.
The flag is available in Bazel 7.1.0. Can you use that version instead and try?
It might take a couple of weeks for us to upgrade to 7.1 . We definitely going to do this in either case yet amount of features we currently disabled is someway worrying...
Do you really think this "throttling" is a prime suspect here?
From the stack trace, it's highly likely. Would you mind patching https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4 on top of 7.0.1 to give it a try?
Ok, let us focus on testing this particular condition next week then.
We were able to use 7.1 and TO is still there
I have another run stuck in
[65,743 / 65,759] ...; 3777s remote, disk-cache, remote-cache
with
#common:remote --experimental_remote_execution_keepalive
common:remote --noexperimental_throttle_remote_action_building
rbe_stuck_1k_seconds.txt rbe_stuck_2k_seconds.txt rbe_stuck_3and6k_seconds.txt rbe_stuck_3k_seconds.txt
Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.
diff --git b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
index cd91a72b50..ce52000954 100644
--- b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
+++ a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
@@ -291,6 +291,8 @@ public final class AsyncTaskCache<KeyT, ValueT> {
Action onAlreadyRunning,
Action onAlreadyFinished,
boolean force) {
+ return task;
+ /*
return Single.create(
emitter -> {
synchronized (lock) {
@@ -342,6 +344,7 @@ public final class AsyncTaskCache<KeyT, ValueT> {
});
}
});
+ */
}
/**
@coeuvre could you please take a look whether new stacktraces and info from @werkt is helpful with identifying the issue?
Thanks for the stacktraces and pointers to the potential root cause. I will try to reproduce it locally now and hopefully come up with a fix.