bazel
bazel copied to clipboard
[6.5.0] Bazel just hangs after successfully completed build for a long time
Description of the bug:
Sometimes after build successfully completed (at least if I look into buildbuddy invocation), bazel just hangs and prints out Build completed successfully for many times. After that bazel process completes successfully.
Piece of CI log here:
...
[2024-03-05T11:26:21.099Z] INFO: Elapsed time: 403.120s, Critical Path: 224.13s
[2024-03-05T11:26:21.099Z] INFO: 4107 processes: 4005 remote cache hit, 54 internal, 48 processwrapper-sandbox.
[2024-03-05T11:26:21.099Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:26:22.465Z] INFO: Streaming build results to: https://build-buddy.local/invocation/312f5fbb-782a-4544-b78a-3b80ead554f3
[2024-03-05T11:26:37.322Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:26:52.194Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:10.255Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:25.125Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:40.009Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:54.873Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:09.730Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:27.793Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:39.985Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:42.505Z] INFO: Build completed successfully, 4107 total actions
This may take a long time (10 mins or more).... How can I find what realy bazel do at that time?
Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I can't understand under what conditions this behavior happens yet.
Which operating system are you running Bazel on?
Ubuntu 20.04
What is the output of bazel info release?
release 6.5.0
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
Which build flags are you using?
Can you send Bazel a SIGQUIT signal (easiest way is to hit Ctrl-\) while it's hanging, then link the resulting jvm.out file here? That should give us a clue of what Bazel is doing.
[2024-03-05T12:40:33.472Z] INFO: Reading 'startup' options from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc: --max_idle_secs=0
[2024-03-05T12:40:33.472Z] INFO: Reading 'startup' options from /home/jenkins/agent/bazel-build/local.bazelrc: --output_user_root=/home/jenkins/agent/bazel-build@tmp/bazel-cache-user-root, --output_base=/home/jenkins/agent/bazel-build@tmp/bazel-cache-user-root/my-workspace-output-base
[2024-03-05T12:40:33.472Z] INFO: Options provided by the client:
[2024-03-05T12:40:33.472Z] Inherited 'common' options: --isatty=0 --terminal_columns=80
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/.bazelrc:
[2024-03-05T12:40:33.472Z] Inherited 'common' options: --enable_bzlmod --lockfile_mode=error
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc:
[2024-03-05T12:40:33.472Z] Inherited 'common' options: --announce_rc --show_progress_rate_limit=15 --color=yes --verbose_failures --bes_timeout=600s --experimental_remote_build_event_upload=minimal --experimental_remote_cache_compression --nolegacy_important_outputs
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/debug.bazelrc:
[2024-03-05T12:40:33.472Z] 'build' options: --strip=always
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/java.bazelrc:
[2024-03-05T12:40:33.472Z] 'build' options: --java_runtime_version=remotejdk_17 --java_language_version=17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/.bazelrc:
[2024-03-05T12:40:33.472Z] 'build' options: --workspace_status_command=hacks/bazel/workspace_status.sh --stamp --incompatible_enable_cc_toolchain_resolution --experimental_action_cache_store_output_metadata --http_timeout_scaling=5.0
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc:
[2024-03-05T12:40:33.472Z] 'build' options: --remote_download_toplevel --remote_timeout=3600 --grpc_keepalive_time=30s --bes_results_url=https://build-buddy.my-dev.com/invocation/ --bes_backend=grpc://buildbuddy.build.svc.cluster.local:1985 --remote_cache=grpc://buildbuddy.build.svc.cluster.local:1985 --remote_timeout=3600
The problem is I can't understand under what conditions this behavior happens and how to reproduce it.
It could be that the build produced an abnormally large amount of log files that need to be uploaded to the remote cache (per --experimental_remote_build_event_upload=minimal). Next time it happens, in addition to capturing the JVM thread dump, I'd keep an eye on network activity.
We're experiencing the same Bazel issue where it hangs for extended periods (10 minutes to an hour) after successfully completing RBE cross-compilation of JAX targets (macOS x86 host to remote Linux x86 VMs). The "Build completed successfully" message prints repeatedly. Adding the --experimental_remote_build_event_upload=minimal flag did not resolve the issue, preventing us from using this job in presubmits.
It could also be that Bazel was downloading toplevel outputs in the background if you set --remote_download_toplevel (although I have no ideas why there wasn't progress report).
For presubmits, can you try with --remote_download_minimal?
If you have already set --remote_download_minimal, please capture the JVM thread dump while it's printing the message repeatedly as per https://github.com/bazelbuild/bazel/issues/21568#issuecomment-1978702073.
Thanks @coeuvre , using --remote_download_minimal solved the issue with our builds.
Hi! We face the same issue, here is the jstack. It indeeds seems like a stalled grpc. jvm.out.log
We don't use any --remote_download_outputs flags.
Thanks for the stack trace. It looks like the the hanging is caused by --experimental_remote_upload_async. I will work on a fix.
@bazel-io fork 7.2.0
I think I'm seeing something similar on bazel 7.3.1, is there a way to check if it's the same issue?
My symptoms are that after a build succeeds or fails, bazel hangs right after:
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
INFO: Elapsed time: 135.870s, Critical Path: 74.66s
INFO: 11701 processes: 89 remote cache hit, 5768 internal, 2202 linux-sandbox, 3642 local.
ERROR: Build did NOT complete successfully
we're also using a remote cache hosted on GCP, but without experimental cache flags
@snakethatlovesstaticlibs Can you please share the stack trace when Bazel is hanging?
(you can capture the stack trace by jcmd <pid> Thread.dump_to_file <file>.)
I’m seeing this too, but so far only in a CI (GitHub Actions) environment so work on getting that info.
@coeuvre
Some updates: I used something like this to get jvm.out from our pipeline: $(set -x; sleep 8m; ps aux | grep bazel; kill -3 $(pgrep -f --newest 'bazel'); sleep 20; pkill -9 java) & disown. Another pipeline ran overnight, and it looks like it's not a complete deadlock, but after about 190 minutes, it's able to exit and then continue. jvm.out attached (renamed to jvm.log so that Github will allow it to be uploaded)
In CI, I was able to solve my issue by setting --max_idle_secs=1, but if this ever happens locally, that would be a pretty big hit to incremental build speed