Bazel 7.4.1 Sporadic Hangs and Server Terminated when tagging tests with resource tags
Description of the bug:
We've recently upgrade to Bazel 7.4.1 from 6.5.0 and we've been noticing a lot of flakiness on our CI runners.
Some behaviors we've seen are
- Bazel Server Hung when running
bazel coverage(no output logged to stout for a couple of minutes and then server is terminated (the time between server terminated and last log output may range from 5-10s to upwards of 5 - 10 minutes. Bazel fails to recover (peak memory usage is only 70% of server total allocated memory)
Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/home/github_actions/.cache/bazel/_bazel_github_actions/049fd0d9a142b0eee346c643b8cf35e6/server/jvm.out')
- Bazel Tests timeout after hitting test time threshold (peak CPU utilization is ~70% when timing out)
we've also set the following flags to attempt to alleviate this CPU timeout
common --experimental_worker_for_repo_fetching=off
common --experimental_sandbox_async_tree_delete_idle_threads=0
test --local_resources=cpu=HOST_CPUS-4
I'm not entirely sure if these 2 are related, but luckily I was able to catch a thread dump on a worker with a hung bazel test
it took a while for me to get a stack as well
226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
I was able to capture an strace as well as java thread dump on a instance where it's hung. we have a monorepo with multiple languages. specifically, we tag our python tests with cpu, gpu_memory, and memory as well and leave the rest of the tests without tags (since those tests aren't as hefty.)
test:ci --local_resources=gpu_memory_mb=15360 --local_resources=memory=HOST_RAM*0.6
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I don't have a minimal repro that I can openly share (day job has a monorepo)
Which operating system are you running Bazel on?
ubuntu 20.04
What is the output of bazel info release?
release 7.4.1
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
Any other information, logs, or outputs that you want to share?
We're using local execution with a remote cache (grpc)
bazel rc flags
Inherited 'common' options: --experimental_repository_cache_urls_as_default_canonical_id --watchfs --@io_bazel_rules_docker//transitions:enable=false --ui_actions_shown=32 --experimental_remote_cache_eviction_retries=5 --experimental_remote_cache_lease_extension --noexperimental_inmemory_dotd_files --experimental_worker_for_repo_fetching=off --experimental_sandbox_async_tree_delete_idle_threads=0 --incompatible_default_to_explicit_init_py --incompatible_allow_tags_propagation --experimental_cc_shared_library --heap_dump_on_oom
Inherited 'build' options: --output_filter=^// --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --compilation_mode=opt --host_compilation_mode=opt --interface_shared_objects --use_top_level_targets_for_symlinks=false --java_runtime_version=remotejdk_11 --@rules_rust//rust/settings:experimental_use_cc_common_link=true --@rules_cuda//cuda:runtime=//third_party:cuda_runtime --@rules_cuda//cuda:archs=compute_61:sm_61;compute_70:sm_70;compute_75:sm_75;compute_80:sm_80,compute_80 --@rules_cuda//cuda:copts=--std=c++17 --incompatible_strict_action_env=true --incompatible_enable_cc_toolchain_resolution --sandbox_base=/dev/shm --sandbox_tmpfs_path=/tmp --workspace_status_command=tools/get_workspace_status --action_env CACHE_EPOCH=1673041430 --flag_alias=python_flag=//rules:python_flags --flag_alias=python_monitor_flag=//rules:python_monitor_flag --flag_alias=use_repo_bridge_binary=//waabi/onboard/bin/bridge:enabled --aspects=@rules_rust//rust:defs.bzl%rust_clippy_aspect --experimental_repository_cache_hardlinks --nobuild
it took a while for me to get a stack as well
226872: Unable to access root directory /proc/226872/root of target process 226872
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
$ sudo jstack 226872;
226872: Unable to open socket file /proc/226872/root/tmp/.java_pid468: target process 226872 doesn't respond within 10500ms or HotSpot VM not loaded
I was able to capture an strace as well as java thread dump on a instance where it's hung
We are seeing something similar. Using buildbuddy we can see the box stops using resources after the tests run and just hangs for up to 30minutes before completing successfully. This only occurs in CI/CD (github actions hosted runners).
We figured out this was related to running on Github Actions Enterprise ARM boxes. Switching to x64 boxes resolved the hanging boxes.