bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Bazel stuck during remote execution

Open dkashyn-sfdc opened this issue 1 year ago • 49 comments

Description of the bug:

Bazel build current state

[65,456 / 65,465] PackageZip ...rcejar.jar; 20129s remote

when there is no such issue observed this run takes less than a minute if not less than 10s.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

There is no easy way to reproduce it and some noticeable % of our RBE runs are stuck forever without Bazel finishing properly.

Here is the thread dump of Bazel server and there is no activity on RBE side workers at the same time

Which operating system are you running Bazel on?

5.14.0-362.18.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jan 3 15:54:45 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

What is the output of bazel info release?

release 7.0.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

We do have the following in .bazelrc

common:remote --jobs=300
common:remote --remote_timeout=3600

dkashyn-sfdc avatar Mar 08 '24 23:03 dkashyn-sfdc

scratch_45.txt here is the thread dump of Bazel server when this is happening

dkashyn-sfdc avatar Mar 08 '24 23:03 dkashyn-sfdc

I recommend not using --experimental_remote_execution_keepalive.

benjaminp avatar Mar 08 '24 23:03 benjaminp

Let us try out to remove this!

dkashyn-sfdc avatar Mar 08 '24 23:03 dkashyn-sfdc

Should it TO even with keep alive though?

dkashyn-sfdc avatar Mar 08 '24 23:03 dkashyn-sfdc

Still happening even without keep alive

 cat .bazelrc |grep keep
#common:remote --experimental_remote_execution_keepalive
[65,553 / 65,563] .... 1966s remote

dkashyn-sfdc avatar Mar 11 '24 17:03 dkashyn-sfdc

bazel_server_dump.txt another dump with keep_alive off

dkashyn-sfdc avatar Mar 11 '24 17:03 dkashyn-sfdc

FYI @werkt this is my colleague and we're using Buildfarm

jasonschroeder-sfdc avatar Mar 11 '24 22:03 jasonschroeder-sfdc

From both stack dumps, it seems like bazel is waiting for the remote execution to be done. Did it happen for Bazel 6? Do the logs from remote server tell anything?

coeuvre avatar Mar 12 '24 10:03 coeuvre

We are only starting to experiment with RBE so we don't have pre-7.0 baseline, unfortunately.

Regarding remote server doing something: I've checked BuildFarm workers and there were no active tasks there.

dkashyn-sfdc avatar Mar 12 '24 14:03 dkashyn-sfdc

Another suspicious stack frame is:


"remote-executor-15" #3372 [22233] prio=5 os_prio=0 cpu=238.42ms elapsed=12851.39s allocated=100633K defined_classes=0 tid=0x00007fb6ac44d050 nid=22233 in Object.wait()  [0x00007fb7562ac000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait0([email protected]/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait([email protected]/Object.java:366)
	at java.lang.Object.wait([email protected]/Object.java:339)
	at com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver.onNext(GrpcServerImpl.java:175)
	- locked <0x000000042f67b158> (a com.google.devtools.build.lib.server.GrpcServerImpl$BlockingStreamObserver)
	at com.google.devtools.build.lib.server.GrpcServerImpl$RpcOutputStream.write(GrpcServerImpl.java:253)
	at com.google.devtools.build.lib.util.io.DelegatingOutErr$DelegatingOutputStream.write(DelegatingOutErr.java:89)
	at java.io.BufferedOutputStream.flushBuffer([email protected]/BufferedOutputStream.java:125)
	at java.io.BufferedOutputStream.implFlush([email protected]/BufferedOutputStream.java:252)
	at java.io.BufferedOutputStream.flush([email protected]/BufferedOutputStream.java:240)
	at com.google.devtools.build.lib.runtime.UiEventHandler$FullyBufferedOutputStream.flush(UiEventHandler.java:146)
	at com.google.devtools.build.lib.util.io.AnsiTerminal.flush(AnsiTerminal.java:192)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handleLocked(UiEventHandler.java:373)
	- locked <0x000000042f67b2f0> (a com.google.devtools.build.lib.runtime.UiEventHandler)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handleInternal(UiEventHandler.java:443)
	at com.google.devtools.build.lib.runtime.UiEventHandler.handle(UiEventHandler.java:471)
	at com.google.devtools.build.lib.events.Reporter.handle(Reporter.java:127)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.reportUploadError(ByteStreamBuildEventArtifactUploader.java:342)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$uploadLocalFiles$5(ByteStreamBuildEventArtifactUploader.java:371)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda/0x00007fb75c9f5568.apply(Unknown Source)
	at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext$ResumeMainSingleObserver.onError(SingleResumeNext.java:73)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
	at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
	at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)

It seems like the Bazel server has trouble sending message to the client. Can you share the list of flags your build uses? And maybe try disabling BEP/BES?

coeuvre avatar Mar 13 '24 09:03 coeuvre

BEP is async and this is confirmed to be working properly in an async way starting from 7.0

build --bes_backend=grpcs://...
build --bes_upload_mode=fully_async
build --build_event_text_file_upload_mode=fully_async
build --build_event_binary_file_upload_mode=fully_async
build --build_event_json_file_upload_mode=fully_async

Here is the full list of remote-related configs

common:remote_sfw_uswest2 --remote_executor=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --remote_cache=grpcs://bazel-buildfarm....
common:remote_sfw_uswest2 --config=remote

# remote defaults.
common:remote --define=EXECUTOR=remote
common:remote --experimental_remote_cache_compression  # https://github.com/buildbuddy-io/buildbuddy/blob/master/.bazelrc#L32
common:remote --jobs=300
common:remote --remote_timeout=3600
common:remote --build_metadata=RBE=True
common:remote --platforms=@rbe_default//config:platform
common:remote --host_platform=@rbe_default//config:platform
common:remote --extra_execution_platforms=@rbe_default//config:platform
common:remote --crosstool_top=@rbe_default//cc:toolchain
common:remote --extra_toolchains=@rbe_default//config:cc-toolchain
common:remote --disk_cache=
common:remote --cpu=k8 --host_cpu=k8 # force tools to use intel x86_64 CPU (in case you are trying on an Apple Mac)
#common:remote --experimental_remote_execution_keepalive

dkashyn-sfdc avatar Mar 13 '24 13:03 dkashyn-sfdc

From the stack trace above, Bazel encountered errors when uploading blobs to BES/BEP. The errors were not printed out because Bazel server stuck at sending the message to client.

I suggested to try disabling BES/BEP because by doing so we know whether this (the upload error) is the reason why Bazel stuck.

coeuvre avatar Mar 13 '24 13:03 coeuvre

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

What I remember from server logs earlier: there were no traces of any uploads in other Bazel server logs.

dkashyn-sfdc avatar Mar 13 '24 15:03 dkashyn-sfdc

❯ cat .bazelrc | grep -e bes -e event
#build --bes_backend=grpcs://...
#build --bes_upload_mode=fully_async
#build --build_event_text_file_upload_mode=fully_async
#build --build_event_binary_file_upload_mode=fully_async
#build --build_event_json_file_upload_mode=fully_async

I see image

rbe_another_dump.txt

This time RBE server might be not really healthy yet in this case I would expect Bazel to fail after a reasonable timeout.

dkashyn-sfdc avatar Mar 13 '24 22:03 dkashyn-sfdc

I'll try to repro it without BES/BEP in next couple of days. The issue still should be fixed though even if this is because of BEP/BES ;) We specifically asking for async.

Right, we are still investigate what is the root cause and I am trying to remove the variances as much as possible.

From the new stack trace:

skyframe-evaluator-execution-150" #12989 [603421] daemon prio=5 os_prio=0 cpu=3342.37ms elapsed=1629.24s allocated=1920M defined_classes=0 tid=0x00007f7c5c24ee30 nid=603421 waiting on condition  [0x00007f7b13432000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x000000056776f8d0> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:269)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:756)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos([email protected]/AbstractQueuedSynchronizer.java:1126)
	at java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:276)
	at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingAwait(BlockingMultiObserver.java:134)
	at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1497)
	at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:117)
	at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1451)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:272)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda/0x00007f7d0c88b4e8.call(Unknown Source)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:245)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:127)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:265)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
        ...

it seems like Bazel stuck at uploading inputs to remote server. Since you set --remote_timeout=3600, Bazel will only time out after 1 hour. Can you try with a lower value to check whether Bazel can time out?

coeuvre avatar Mar 14 '24 09:03 coeuvre

It seems relate to --experimental_throttle_remote_action_building. Can you try with --noexperimental_throttle_remote_action_building?

coeuvre avatar Mar 14 '24 12:03 coeuvre

As you can see here https://github.com/bazelbuild/bazel/issues/21626#issuecomment-1994457946 we don't have this enabled. I still can turn it off if required though.

dkashyn-sfdc avatar Mar 14 '24 13:03 dkashyn-sfdc

Ok... I see it is enabled by default. Let me turn that off

  @Option(
      name = "experimental_throttle_remote_action_building",
      defaultValue = "true",

dkashyn-sfdc avatar Mar 14 '24 13:03 dkashyn-sfdc

What to do with BES/BEP and keep_alive that were turned off earlier ?

dkashyn-sfdc avatar Mar 14 '24 13:03 dkashyn-sfdc

Let's keep them off for now.

coeuvre avatar Mar 14 '24 13:03 coeuvre

This flag is 8.x only, based on https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4

I got

ERROR: --noexperimental_throttle_remote_action_building :: Unrecognized option: --noexperimental_throttle_remote_action_building

for

cat .bazelversion 
7.0.1

So I'd say this is definitely not this property.

dkashyn-sfdc avatar Mar 14 '24 13:03 dkashyn-sfdc

So I'd say this is definitely not this property.

Although the flag was not available for 7.0.1, the feature is still there and on.

The flag is available in Bazel 7.1.0. Can you use that version instead and try?

coeuvre avatar Mar 14 '24 14:03 coeuvre

It might take a couple of weeks for us to upgrade to 7.1 . We definitely going to do this in either case yet amount of features we currently disabled is someway worrying...

Do you really think this "throttling" is a prime suspect here?

dkashyn-sfdc avatar Mar 14 '24 14:03 dkashyn-sfdc

From the stack trace, it's highly likely. Would you mind patching https://github.com/bazelbuild/bazel/commit/294c904c30fe305f60e548a438940d5ab60a15b4 on top of 7.0.1 to give it a try?

coeuvre avatar Mar 14 '24 14:03 coeuvre

Ok, let us focus on testing this particular condition next week then.

dkashyn-sfdc avatar Mar 14 '24 14:03 dkashyn-sfdc

We were able to use 7.1 and TO is still there

I have another run stuck in

[65,743 / 65,759] ...; 3777s remote, disk-cache, remote-cache

with

#common:remote --experimental_remote_execution_keepalive
common:remote --noexperimental_throttle_remote_action_building

rbe_stuck_1k_seconds.txt rbe_stuck_2k_seconds.txt rbe_stuck_3and6k_seconds.txt rbe_stuck_3k_seconds.txt

dkashyn-sfdc avatar Mar 15 '24 16:03 dkashyn-sfdc

Disabling the async task cache fixes this problem after inspiring up to 100 writes per action on builds. Hangs every time with the task cache enabled, completes every time with it disabled.

diff --git b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
index cd91a72b50..ce52000954 100644
--- b/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
+++ a/src/main/java/com/google/devtools/build/lib/remote/util/AsyncTaskCache.java
@@ -291,6 +291,8 @@ public final class AsyncTaskCache<KeyT, ValueT> {
       Action onAlreadyRunning,
       Action onAlreadyFinished,
       boolean force) {
+    return task;
+    /*
     return Single.create(
         emitter -> {
           synchronized (lock) {
@@ -342,6 +344,7 @@ public final class AsyncTaskCache<KeyT, ValueT> {
                 });
           }
         });
+        */
   }

   /**

werkt avatar Mar 16 '24 04:03 werkt

@coeuvre could you please take a look whether new stacktraces and info from @werkt is helpful with identifying the issue?

dkashyn-sfdc avatar Mar 18 '24 15:03 dkashyn-sfdc

Thanks for the stacktraces and pointers to the potential root cause. I will try to reproduce it locally now and hopefully come up with a fix.

coeuvre avatar Mar 19 '24 12:03 coeuvre