bazel icon indicating copy to clipboard operation
bazel copied to clipboard

[6.5.0] Bazel just hangs after successfully completed build for a long time

Open farcop opened this issue 1 year ago • 8 comments

Description of the bug:

Sometimes after build successfully completed (at least if I look into buildbuddy invocation), bazel just hangs and prints out Build completed successfully for many times. After that bazel process completes successfully.

Piece of CI log here:

...
[2024-03-05T11:26:21.099Z] INFO: Elapsed time: 403.120s, Critical Path: 224.13s

[2024-03-05T11:26:21.099Z] INFO: 4107 processes: 4005 remote cache hit, 54 internal, 48 processwrapper-sandbox.
[2024-03-05T11:26:21.099Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:26:22.465Z] INFO: Streaming build results to: https://build-buddy.local/invocation/312f5fbb-782a-4544-b78a-3b80ead554f3
[2024-03-05T11:26:37.322Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:26:52.194Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:10.255Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:25.125Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:40.009Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:27:54.873Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:09.730Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:27.793Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:39.985Z] INFO: Build completed successfully, 4107 total actions
[2024-03-05T11:28:42.505Z] INFO: Build completed successfully, 4107 total actions

This may take a long time (10 mins or more).... How can I find what realy bazel do at that time?

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I can't understand under what conditions this behavior happens yet.

Which operating system are you running Bazel on?

Ubuntu 20.04

What is the output of bazel info release?

release 6.5.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

farcop avatar Mar 05 '24 11:03 farcop

Which build flags are you using?

Can you send Bazel a SIGQUIT signal (easiest way is to hit Ctrl-\) while it's hanging, then link the resulting jvm.out file here? That should give us a clue of what Bazel is doing.

tjgq avatar Mar 05 '24 12:03 tjgq

[2024-03-05T12:40:33.472Z] INFO: Reading 'startup' options from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc: --max_idle_secs=0
[2024-03-05T12:40:33.472Z] INFO: Reading 'startup' options from /home/jenkins/agent/bazel-build/local.bazelrc: --output_user_root=/home/jenkins/agent/bazel-build@tmp/bazel-cache-user-root, --output_base=/home/jenkins/agent/bazel-build@tmp/bazel-cache-user-root/my-workspace-output-base
[2024-03-05T12:40:33.472Z] INFO: Options provided by the client:
[2024-03-05T12:40:33.472Z]   Inherited 'common' options: --isatty=0 --terminal_columns=80
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/.bazelrc:
[2024-03-05T12:40:33.472Z]   Inherited 'common' options: --enable_bzlmod --lockfile_mode=error
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc:
[2024-03-05T12:40:33.472Z]   Inherited 'common' options: --announce_rc --show_progress_rate_limit=15 --color=yes --verbose_failures --bes_timeout=600s --experimental_remote_build_event_upload=minimal --experimental_remote_cache_compression --nolegacy_important_outputs
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/debug.bazelrc:
[2024-03-05T12:40:33.472Z]   'build' options: --strip=always
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/java.bazelrc:
[2024-03-05T12:40:33.472Z]   'build' options: --java_runtime_version=remotejdk_17 --java_language_version=17 --tool_java_language_version=17 --tool_java_runtime_version=remotejdk_17
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/.bazelrc:
[2024-03-05T12:40:33.472Z]   'build' options: --workspace_status_command=hacks/bazel/workspace_status.sh --stamp --incompatible_enable_cc_toolchain_resolution --experimental_action_cache_store_output_metadata --http_timeout_scaling=5.0
[2024-03-05T12:40:33.472Z] INFO: Reading rc options for 'build' from /home/jenkins/agent/bazel-build/bazelrc/ci.bazelrc:
[2024-03-05T12:40:33.472Z]   'build' options: --remote_download_toplevel --remote_timeout=3600 --grpc_keepalive_time=30s --bes_results_url=https://build-buddy.my-dev.com/invocation/ --bes_backend=grpc://buildbuddy.build.svc.cluster.local:1985 --remote_cache=grpc://buildbuddy.build.svc.cluster.local:1985 --remote_timeout=3600

The problem is I can't understand under what conditions this behavior happens and how to reproduce it.

farcop avatar Mar 05 '24 12:03 farcop

It could be that the build produced an abnormally large amount of log files that need to be uploaded to the remote cache (per --experimental_remote_build_event_upload=minimal). Next time it happens, in addition to capturing the JVM thread dump, I'd keep an eye on network activity.

tjgq avatar Mar 05 '24 13:03 tjgq

We're experiencing the same Bazel issue where it hangs for extended periods (10 minutes to an hour) after successfully completing RBE cross-compilation of JAX targets (macOS x86 host to remote Linux x86 VMs). The "Build completed successfully" message prints repeatedly. Adding the --experimental_remote_build_event_upload=minimal flag did not resolve the issue, preventing us from using this job in presubmits.

kanglant avatar Apr 18 '24 17:04 kanglant

It could also be that Bazel was downloading toplevel outputs in the background if you set --remote_download_toplevel (although I have no ideas why there wasn't progress report).

For presubmits, can you try with --remote_download_minimal?

If you have already set --remote_download_minimal, please capture the JVM thread dump while it's printing the message repeatedly as per https://github.com/bazelbuild/bazel/issues/21568#issuecomment-1978702073.

coeuvre avatar Apr 19 '24 08:04 coeuvre

Thanks @coeuvre , using --remote_download_minimal solved the issue with our builds.

kanglant avatar Apr 22 '24 17:04 kanglant

Hi! We face the same issue, here is the jstack. It indeeds seems like a stalled grpc. jvm.out.log

We don't use any --remote_download_outputs flags.

polac24 avatar May 15 '24 18:05 polac24

Thanks for the stack trace. It looks like the the hanging is caused by --experimental_remote_upload_async. I will work on a fix.

coeuvre avatar May 16 '24 12:05 coeuvre

@bazel-io fork 7.2.0

brentleyjones avatar May 22 '24 15:05 brentleyjones

I think I'm seeing something similar on bazel 7.3.1, is there a way to check if it's the same issue?

My symptoms are that after a build succeeds or fails, bazel hangs right after:

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
INFO: Elapsed time: 135.870s, Critical Path: 74.66s
INFO: 11701 processes: 89 remote cache hit, 5768 internal, 2202 linux-sandbox, 3642 local.
ERROR: Build did NOT complete successfully

we're also using a remote cache hosted on GCP, but without experimental cache flags

@snakethatlovesstaticlibs Can you please share the stack trace when Bazel is hanging?

(you can capture the stack trace by jcmd <pid> Thread.dump_to_file <file>.)

coeuvre avatar Sep 04 '24 08:09 coeuvre

I’m seeing this too, but so far only in a CI (GitHub Actions) environment so work on getting that info.

aaomidi avatar Sep 04 '24 14:09 aaomidi

@coeuvre

Some updates: I used something like this to get jvm.out from our pipeline: $(set -x; sleep 8m; ps aux | grep bazel; kill -3 $(pgrep -f --newest 'bazel'); sleep 20; pkill -9 java) & disown. Another pipeline ran overnight, and it looks like it's not a complete deadlock, but after about 190 minutes, it's able to exit and then continue. jvm.out attached (renamed to jvm.log so that Github will allow it to be uploaded)

jvm.log

In CI, I was able to solve my issue by setting --max_idle_secs=1, but if this ever happens locally, that would be a pretty big hit to incremental build speed