bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Bazel 7.4.1 Sporadic Hangs and Server Terminated when tagging tests with resource tags

Open Ryang20718 opened this issue 1 year ago • 2 comments

Description of the bug:

We've recently upgrade to Bazel 7.4.1 from 6.5.0 and we've been noticing a lot of flakiness on our CI runners.

Some behaviors we've seen are

  • Bazel Server Hung when running bazel coverage (no output logged to stout for a couple of minutes and then server is terminated (the time between server terminated and last log output may range from 5-10s to upwards of 5 - 10 minutes. Bazel fails to recover (peak memory usage is only 70% of server total allocated memory)
Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/home/github_actions/.cache/bazel/_bazel_github_actions/049fd0d9a142b0eee346c643b8cf35e6/server/jvm.out')
  • Bazel Tests timeout after hitting test time threshold (peak CPU utilization is ~70% when timing out)

we've also set the following flags to attempt to alleviate this CPU timeout

common --experimental_worker_for_repo_fetching=off
common --experimental_sandbox_async_tree_delete_idle_threads=0
test --local_resources=cpu=HOST_CPUS-4

I'm not entirely sure if these 2 are related, but luckily I was able to catch a thread dump on a worker with a hung bazel test

it took a while for me to get a stack as well

226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded

hung.txt strace_hang.txt

I was able to capture an strace as well as java thread dump on a instance where it's hung. we have a monorepo with multiple languages. specifically, we tag our python tests with cpu, gpu_memory, and memory as well and leave the rest of the tests without tags (since those tests aren't as hefty.)

test:ci --local_resources=gpu_memory_mb=15360  --local_resources=memory=HOST_RAM*0.6

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I don't have a minimal repro that I can openly share (day job has a monorepo)

Which operating system are you running Bazel on?

ubuntu 20.04

What is the output of bazel info release?

release 7.4.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Any other information, logs, or outputs that you want to share?

We're using local execution with a remote cache (grpc)

bazel rc flags

Inherited 'common' options: --experimental_repository_cache_urls_as_default_canonical_id --watchfs --@io_bazel_rules_docker//transitions:enable=false --ui_actions_shown=32 --experimental_remote_cache_eviction_retries=5 --experimental_remote_cache_lease_extension --noexperimental_inmemory_dotd_files --experimental_worker_for_repo_fetching=off --experimental_sandbox_async_tree_delete_idle_threads=0 --incompatible_default_to_explicit_init_py --incompatible_allow_tags_propagation --experimental_cc_shared_library --heap_dump_on_oom 

  Inherited 'build' options: --output_filter=^// --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --compilation_mode=opt --host_compilation_mode=opt --interface_shared_objects --use_top_level_targets_for_symlinks=false --java_runtime_version=remotejdk_11 --@rules_rust//rust/settings:experimental_use_cc_common_link=true --@rules_cuda//cuda:runtime=//third_party:cuda_runtime --@rules_cuda//cuda:archs=compute_61:sm_61;compute_70:sm_70;compute_75:sm_75;compute_80:sm_80,compute_80 --@rules_cuda//cuda:copts=--std=c++17 --incompatible_strict_action_env=true --incompatible_enable_cc_toolchain_resolution --sandbox_base=/dev/shm --sandbox_tmpfs_path=/tmp --workspace_status_command=tools/get_workspace_status --action_env CACHE_EPOCH=1673041430 --flag_alias=python_flag=//rules:python_flags --flag_alias=python_monitor_flag=//rules:python_monitor_flag --flag_alias=use_repo_bridge_binary=//waabi/onboard/bin/bridge:enabled --aspects=@rules_rust//rust:defs.bzl%rust_clippy_aspect --experimental_repository_cache_hardlinks --nobuild

Ryang20718 avatar Nov 27 '24 08:11 Ryang20718

it took a while for me to get a stack as well

226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded

hung.txt strace_hang.txt

I was able to capture an strace as well as java thread dump on a instance where it's hung

Ryang20718 avatar Nov 29 '24 09:11 Ryang20718

We are seeing something similar. Using buildbuddy we can see the box stops using resources after the tests run and just hangs for up to 30minutes before completing successfully. This only occurs in CI/CD (github actions hosted runners).

NishKebab avatar Jun 12 '25 15:06 NishKebab

We figured out this was related to running on Github Actions Enterprise ARM boxes. Switching to x64 boxes resolved the hanging boxes.

NishKebab avatar Jul 07 '25 13:07 NishKebab