bazel
bazel copied to clipboard
Bazel clean --expunge or Bazel shutdown unable to kill stale bazel processes
Description of the problem:
The bazel buld or bazel query creates a stale bazel process even after the bazel build/query is completed. This prevents future invocation of other bazel commands
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
On Tekton task list we are following below commands
- bazel query //... (or a list of targets)
- Once the query is completed, we are still seeing a bazel process and its child process seen running
jenkins 2064 1 47 04:49 ? 00:03:07 bazel(directory) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/jli:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/server:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/ -Dfile.encoding=ISO-8859-1 -jar /home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=120 --output_user_root=/home/jenkins/.cache/bazel/_bazel_jenkins --install_base=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b --install_md5=ba7765e6f39a679257358196b530585b --output_base=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --workspace_directory=/home/jenkins/13518/directory --default_system_javabase=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64 --failure_detail_out=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/failure_detail.rawproto --deep_execroot --expand_configs_in_place --idle_server_tasks --write_command_log --nowatchfs --nofatal_event_bus_exceptions --nowindows_enable_symlinks --client_debug=false --product_name=Bazel --noincompatible_enable_execution_transition --option_sources=connect_Utimeout_Usecs:/home/jenkins/13518/directory/.bazelrc:max_Uidle_Usecs:/home/jenkins/13518/directory/.bazelrc
jenkins 11863 2064 62 04:52 ? 00:02:13 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins 11866 2064 51 04:52 ? 00:01:50 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins 11877 2064 66 04:52 ? 00:02:21 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins 11879 2064 59 04:52 ? 00:02:07 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins 16288 1993 0 04:56 ? 00:00:00 grep bazel
We are unable to stop these processes, As per this we added a
bazel shutdown
That didn't shut down any. We got this error:
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=2064) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=2064) to terminate.
FATAL: Attempted to kill stale server process (pid=2064) using SIGKILL, but it did not die in a timely fashion.
The bazel clean --expunge also shows the same error.
What operating system are you running Bazel on?
Redhat 7.9 Docker container running in K8S pod (as a Tekton task)
What's the output of bazel info release?
Extracting Bazel installation... Starting local Bazel server and connecting to it... release 3.2.0
If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.
NA
What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?
Is it required?
Have you found anything relevant by searching the web
Followed this thread Included the
bazel shutdown
command, but it didn't stop the existing bazel processes.
Any other information, logs, or outputs that you want to share?
Will share further if required.
Do you have an easy way to reproduce this?
No other way I could see.
meet the same problem @uajith @meisterT
@clemente0420 in a different environment?
@clemente0420 in a different environment?
centos docker in ubuntu 1604 x86 host,use bazelisk clean ,then happens
@uajith @meisterT got problem, check your host zombie process
Ok, it works.. Looks like the zombie process is causing the problem.
Can you confirm whether you're still seeing real (non-zombies) Bazel processes that you can't get rid off?
If these are zombie processes, I think it would help to run an init process inside your container (with Docker we like to use docker run --init for this purpose).
I just ran into the same issue, also inside a container env that we use for all kinds of build processes, including bitbake which has a client-server structure as well. Simply issuing bazel and then bazel shutdown exposes the issue, adding the otherwise unneeded --init to the container env works around it.
What makes only bazel stumble here? Can't this be resolved differently?
What does the server log say during the shutdown (use bazel info | grep server_log before shutdown to find the log)?
This is with bazel-bootstrap from Debian bullseye: java.log.90be0d0794f5.builder.log.java.20220110-172052.2484.txt
I don't find messages being added when shutdown is invoked, though.
It does indicate that it finishes the shutdown command within a fraction of a second. So it's not that something within Bazel itself waiting forever during the shutdown command, but something about the state it ends up in.
I am having the same issue. Bazel doesn't exit after build
If you experience this please run jstack <pid> where <pid> is the PID of the Bazel server.
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
- run bazel inside the container, with customized
output_base - run
bazel shutdown - rerun step 1
- now I see logs like (ignore the
stderr |prefix):
stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.
The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java] <defunct>
Any hints for this behavior?
I am using 5.1.1
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
- run bazel inside the container, with customized
output_base- run
bazel shutdown- rerun step 1
- now I see logs like (ignore the
stderr |prefix):stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java] <defunct>Any hints for this behavior?
I am using 5.1.1
Look like my case (or if you are running bazel inside a container) is related to this init process issue: https://github.com/kubernetes/kubernetes/issues/84210
Just hit the same problem in our CI/CD pipeline. The problem was yes, the lack of an init process / child reaper.
What happens:
bazel shutdownor any bazel command that requires killing/restarting the bazel daemon will usekill($serverPid)to terminate the server.- In a container, be it a k8 or plain docker, if
PID 1is not a process that willreap children(eg, waitpid for any child that dies), the bazel daemon with$serverPidwill remain as a zombie once killed. From the OS point of view, the process with$serverPidwill keep existing, both as a PID and as a file in/proc/$serverPiduntil a parentwaitpids on it. - As per code in
src/main/cpp/blaze_util_posix.cc, the bazel command trying to kill the bazel servers keeps sendingkill -TERM $serverPidorkill -9 $serverPiduntil ... the pid goes away from/proc/$serverPidor untilkilld($serverPid, 0)returns error (depending on platform). - Given that there is no child reaper, no init process... the zombie sticks around forever, the pid never goes away, and the command trying to kill bazel thinks the process is still running until eventually times out with the error in this bug.
Solution/fix: in your container, use an entrypoint that does child reaping. Eg, have PID 1 be /bin/docker-init, /sbin/init, or custom code. Alternatively, run something in the container that does child reaping via PR_SET_CHILD_SUBREAPER, like /bin/docker-init -s.
Is it possible to run a "sleep" and kill any pending bazel processes ?
On Sun, 10 Apr 2022, 22:08 hbc, @.***> wrote:
I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:
- run bazel inside the container, with customized output_base
- run bazel shutdown
- rerun step 1
- now I see logs like (ignore the stderr | prefix):
stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.
The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:
root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java]
Any hints for this behavior?
I am using 5.1.1
Look like my case (or if you are running bazel inside a container) is related to this init process issue: kubernetes/kubernetes#84210 https://github.com/kubernetes/kubernetes/issues/84210
— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1094309593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI6IMYHXYATROOH2ZO42HDVEL7XFANCNFSM5B3MNCPQ . You are receiving this because you were mentioned.Message ID: @.***>
So how did we solve this problem? I'm using a ci/cd system the jenkins + kubernetes plugin way.
The process of my jenkins agent looks like this
I ended up getting the following error.
+ make build-release
bin/bazel clean --expunge
(07:04:55) [32mINFO: [0mStarting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
(07:04:55) [32mINFO: [0mClean command is running, shutting down worker pool...
[0mWARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=44) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=44) to terminate.
FATAL: Attempted to kill stale server process (pid=44) using SIGKILL, but it did not die in a timely fashion.
make: *** [build-release] Error 36
We've got the same issue while using Bazel in GitHub Actions to build/test a large C++ code base: we sometimes see the "Stop containers" step taking minutes or even >1h / running into a timeout after both bazel build and bazel test have already completed successfully. Usually "Stop containers" takes a few seconds, and this only happens maybe every 100 or 200 runs.
After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.
After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.
See the explanation on https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1247177037 above. Tl;Dr: bazel shutdown waits for the "pid to disappear", but if there is no "child reaper" (eg, init, or something doing waitpid on the dead daemon) in any unix system the pid will keep existing and sticking around (so the status code, error state, etc is not lost). From documentation, it looks like --init in docker starts an init, which does child reaping for zombies.
We are having the same issue running inside jenkins docker with docker exec on amazon linux 2023. I don't think we can change jenkins agent configuration or configure docker inside jenkins pipeline (jenkinsfile) to add --init. Unfortunately python package tink is using bazel and with this issue opened for past 3 years it is really hard to build wheels ourselves. Is should be possible to run bazel without server/deamon mode or at least should be compatible with docker 2024 (out of the box).
running build_ext
bazel clean --expunge
Starting local Bazel server and connecting to it...
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=292) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=292) to terminate.
FATAL: Attempted to kill stale server process (pid=292) using SIGKILL, but it did not die in a timely fashion.
error: command '/usr/bin/bazel' failed with exit code 36
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tink
Running setup.py clean for tink
Failed to build tink
ERROR: Failed to build one or more wheels
[Pipeline] }
I am not sure whether this should be Bazel's business.
Next to docker --init there are other ways to spawn a lightweight init in the container, e.g. https://github.com/phusion/baseimage-docker/blob/rel-0.9.16/image/bin/my_init or https://github.com/Yelp/dumb-init.
I don't know but with other build systems we don't have such problem. So it seems it is specific to how bazel works. I have managed to workaround jumping through some hoops to run with --init because I needed to include infra guys ... and now it works with the provided workaround.
https://docs.docker.com/config/containers/multi-service_container/
The container's main process is responsible for managing all processes that it starts.
In some cases, the main process isn't well-designed, and doesn't handle "reaping"
(stopping) child processes gracefully when the container exits. If your process falls
into this category, you can use the --init option when you run the container.
So would be nice to have this link in bazel docker docs or fix the issue and have better handling of child processes.
Just a note: if you're trying to use [tini](https://github.com/krallin/tini) as init process (Docker uses it for their own official images), make sure to run tini with -s flag specifically -s: Register as a process subreaper (requires Linux >= 3.4)., otherwise just running tini -- won't be enough for this Bazel issue.
/usr/bin/tini -s -- your_ci_or_bazel_command
Also note that ENTRYPOINT set in Dockerfile gets overriden by Kubernetes command, so make sure to wrap the command with init manager in such a case.
Hope this saves someone's time :)