bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Bazel clean --expunge or Bazel shutdown unable to kill stale bazel processes

Open uajith opened this issue 4 years ago • 24 comments

Description of the problem:

The bazel buld or bazel query creates a stale bazel process even after the bazel build/query is completed. This prevents future invocation of other bazel commands

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

On Tekton task list we are following below commands

  1. bazel query //... (or a list of targets)
  2. Once the query is completed, we are still seeing a bazel process and its child process seen running
    jenkins    2064      1 47 04:49 ?        00:03:07 bazel(directory) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -Xverify:none -Djava.util.logging.config.file=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/javalog.properties -Dcom.google.devtools.build.lib.util.LogHandlerQuerier.class=com.google.devtools.build.lib.util.SimpleLogHandler$HandlerQuerier -XX:-MaxFDLimit -Djava.library.path=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/jli:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/embedded_tools/jdk/lib/server:/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/ -Dfile.encoding=ISO-8859-1 -jar /home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b/A-server.jar --max_idle_secs=10800 --noshutdown_on_low_sys_mem --connect_timeout_secs=120 --output_user_root=/home/jenkins/.cache/bazel/_bazel_jenkins --install_base=/home/jenkins/.cache/bazel/_bazel_jenkins/install/ba7765e6f39a679257358196b530585b --install_md5=ba7765e6f39a679257358196b530585b --output_base=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8 --workspace_directory=/home/jenkins/13518/directory --default_system_javabase=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64 --failure_detail_out=/home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/failure_detail.rawproto --deep_execroot --expand_configs_in_place --idle_server_tasks --write_command_log --nowatchfs --nofatal_event_bus_exceptions --nowindows_enable_symlinks --client_debug=false --product_name=Bazel --noincompatible_enable_execution_transition --option_sources=connect_Utimeout_Usecs:/home/jenkins/13518/directory/.bazelrc:max_Uidle_Usecs:/home/jenkins/13518/directory/.bazelrc
jenkins   11863   2064 62 04:52 ?        00:02:13 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins   11866   2064 51 04:52 ?        00:01:50 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins   11877   2064 66 04:52 ?        00:02:21 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins   11879   2064 59 04:52 ?        00:02:07 /home/jenkins/.cache/bazel/_bazel_jenkins/41b4626fb6512837d24f630cb1632ba8/execroot/com_ibm_monorepo/external/remotejdk11_linux/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings --add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED --add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --patch-module=java.compiler=external/remote_java_tools_linux/java_tools/java_compiler.jar --patch-module=jdk.compiler=external/remote_java_tools_linux/java_tools/jdk_compiler.jar --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED -jar external/remote_java_tools_linux/java_tools/JavaBuilder_deploy.jar --persistent_worker
jenkins   16288   1993  0 04:56 ?        00:00:00 grep bazel

We are unable to stop these processes, As per this we added a

bazel shutdown

That didn't shut down any. We got this error:

WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=2064) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=2064) to terminate.
FATAL: Attempted to kill stale server process (pid=2064) using SIGKILL, but it did not die in a timely fashion.

The bazel clean --expunge also shows the same error.

What operating system are you running Bazel on?

Redhat 7.9 Docker container running in K8S pod (as a Tekton task)

What's the output of bazel info release?

Extracting Bazel installation... Starting local Bazel server and connecting to it... release 3.2.0

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

NA

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Is it required?

Have you found anything relevant by searching the web

Followed this thread Included the

bazel shutdown  

command, but it didn't stop the existing bazel processes.

Any other information, logs, or outputs that you want to share?

Will share further if required.

uajith avatar Aug 10 '21 05:08 uajith

Do you have an easy way to reproduce this?

meisterT avatar Aug 31 '21 11:08 meisterT

No other way I could see.

uajith avatar Aug 31 '21 12:08 uajith

meet the same problem @uajith @meisterT

clemente0731 avatar Aug 31 '21 16:08 clemente0731

@clemente0420 in a different environment?

meisterT avatar Aug 31 '21 16:08 meisterT

@clemente0420 in a different environment?

centos docker in ubuntu 1604 x86 host,use bazelisk clean ,then happens

clemente0731 avatar Sep 01 '21 01:09 clemente0731

@uajith @meisterT got problem, check your host zombie process

clemente0731 avatar Sep 01 '21 03:09 clemente0731

Ok, it works.. Looks like the zombie process is causing the problem.

uajith avatar Sep 01 '21 03:09 uajith

Can you confirm whether you're still seeing real (non-zombies) Bazel processes that you can't get rid off?

If these are zombie processes, I think it would help to run an init process inside your container (with Docker we like to use docker run --init for this purpose).

philwo avatar Sep 09 '21 12:09 philwo

I just ran into the same issue, also inside a container env that we use for all kinds of build processes, including bitbake which has a client-server structure as well. Simply issuing bazel and then bazel shutdown exposes the issue, adding the otherwise unneeded --init to the container env works around it.

What makes only bazel stumble here? Can't this be resolved differently?

jan-kiszka avatar Jan 04 '22 18:01 jan-kiszka

What does the server log say during the shutdown (use bazel info | grep server_log before shutdown to find the log)?

larsrc-google avatar Jan 10 '22 17:01 larsrc-google

This is with bazel-bootstrap from Debian bullseye: java.log.90be0d0794f5.builder.log.java.20220110-172052.2484.txt

I don't find messages being added when shutdown is invoked, though.

jan-kiszka avatar Jan 10 '22 17:01 jan-kiszka

It does indicate that it finishes the shutdown command within a fraction of a second. So it's not that something within Bazel itself waiting forever during the shutdown command, but something about the state it ends up in.

larsrc-google avatar Jan 10 '22 17:01 larsrc-google

I am having the same issue. Bazel doesn't exit after build

aminya avatar Mar 14 '22 22:03 aminya

If you experience this please run jstack <pid> where <pid> is the PID of the Bazel server.

meisterT avatar Mar 15 '22 06:03 meisterT

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):
stderr |  WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr |  INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr |  INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr |  FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root      3292  143  0.0      0     0 ?        Zs   16:22   0:44 [java] <defunct>

Any hints for this behavior?

I am using 5.1.1

bcho avatar Apr 10 '22 16:04 bcho

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):
stderr |  WARNING: Running Bazel server needs to be killed, because the startup options are different.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
stderr |  WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
stderr |  INFO: Waited 60 seconds for server process (pid=3292) to terminate.
stderr |  WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
stderr |  WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
stderr |  INFO: Waited 10 seconds for server process (pid=3292) to terminate.
stderr |  FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root      3292  143  0.0      0     0 ?        Zs   16:22   0:44 [java] <defunct>

Any hints for this behavior?

I am using 5.1.1

Look like my case (or if you are running bazel inside a container) is related to this init process issue: https://github.com/kubernetes/kubernetes/issues/84210

bcho avatar Apr 10 '22 16:04 bcho

Just hit the same problem in our CI/CD pipeline. The problem was yes, the lack of an init process / child reaper.

What happens:

  1. bazel shutdown or any bazel command that requires killing/restarting the bazel daemon will use kill($serverPid) to terminate the server.
  2. In a container, be it a k8 or plain docker, if PID 1 is not a process that will reap children (eg, waitpid for any child that dies), the bazel daemon with $serverPid will remain as a zombie once killed. From the OS point of view, the process with $serverPid will keep existing, both as a PID and as a file in /proc/$serverPid until a parent waitpids on it.
  3. As per code in src/main/cpp/blaze_util_posix.cc, the bazel command trying to kill the bazel servers keeps sending kill -TERM $serverPid or kill -9 $serverPid until ... the pid goes away from /proc/$serverPid or until killd($serverPid, 0) returns error (depending on platform).
  4. Given that there is no child reaper, no init process... the zombie sticks around forever, the pid never goes away, and the command trying to kill bazel thinks the process is still running until eventually times out with the error in this bug.

Solution/fix: in your container, use an entrypoint that does child reaping. Eg, have PID 1 be /bin/docker-init, /sbin/init, or custom code. Alternatively, run something in the container that does child reaping via PR_SET_CHILD_SUBREAPER, like /bin/docker-init -s.

ccontavalli avatar Sep 14 '22 18:09 ccontavalli

Is it possible to run a "sleep" and kill any pending bazel processes ?

On Sun, 10 Apr 2022, 22:08 hbc, @.***> wrote:

I am having the same issue in a container environment. I am trying to narrow down the repro steps. I am using steps as follow:

  1. run bazel inside the container, with customized output_base
  2. run bazel shutdown
  3. rerun step 1
  4. now I see logs like (ignore the stderr | prefix):

stderr | WARNING: Running Bazel server needs to be killed, because the startup options are different. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60) stderr | WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60) stderr | INFO: Waited 60 seconds for server process (pid=3292) to terminate. stderr | WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10) stderr | WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10) stderr | INFO: Waited 10 seconds for server process (pid=3292) to terminate. stderr | FATAL: Attempted to kill stale server process (pid=3292) using SIGKILL, but it did not die in a timely fashion.

The process 3292 is in defunct state and I cannot use jstack to dump the stacktrace:

root 3292 143 0.0 0 0 ? Zs 16:22 0:44 [java]

Any hints for this behavior?

I am using 5.1.1

Look like my case (or if you are running bazel inside a container) is related to this init process issue: kubernetes/kubernetes#84210 https://github.com/kubernetes/kubernetes/issues/84210

— Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1094309593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI6IMYHXYATROOH2ZO42HDVEL7XFANCNFSM5B3MNCPQ . You are receiving this because you were mentioned.Message ID: @.***>

uajith avatar Oct 11 '22 08:10 uajith

So how did we solve this problem? I'm using a ci/cd system the jenkins + kubernetes plugin way.

The process of my jenkins agent looks like this

image

I ended up getting the following error.

+ make build-release
bin/bazel clean --expunge
(07:04:55) [32mINFO: [0mStarting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
(07:04:55) [32mINFO: [0mClean command is running, shutting down worker pool...
[0mWARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
INFO: Waited 60 seconds for server process (pid=44) to terminate.
WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
INFO: Waited 10 seconds for server process (pid=44) to terminate.
FATAL: Attempted to kill stale server process (pid=44) using SIGKILL, but it did not die in a timely fashion.
make: *** [build-release] Error 36

JokerDevops avatar Oct 12 '23 09:10 JokerDevops

We've got the same issue while using Bazel in GitHub Actions to build/test a large C++ code base: we sometimes see the "Stop containers" step taking minutes or even >1h / running into a timeout after both bazel build and bazel test have already completed successfully. Usually "Stop containers" takes a few seconds, and this only happens maybe every 100 or 200 runs.

After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.

nagelp-bosch avatar Nov 08 '23 16:11 nagelp-bosch

After finding this issue we've added '--init' to the 'docker run' invocation. That seems to fix the problem. However it's unclear what and why this was happening in the first place. Our legacy CMake build never caused the "Stop containers" step to hang.

See the explanation on https://github.com/bazelbuild/bazel/issues/13823#issuecomment-1247177037 above. Tl;Dr: bazel shutdown waits for the "pid to disappear", but if there is no "child reaper" (eg, init, or something doing waitpid on the dead daemon) in any unix system the pid will keep existing and sticking around (so the status code, error state, etc is not lost). From documentation, it looks like --init in docker starts an init, which does child reaping for zombies.

ccontavalli avatar Nov 08 '23 17:11 ccontavalli

We are having the same issue running inside jenkins docker with docker exec on amazon linux 2023. I don't think we can change jenkins agent configuration or configure docker inside jenkins pipeline (jenkinsfile) to add --init. Unfortunately python package tink is using bazel and with this issue opened for past 3 years it is really hard to build wheels ourselves. Is should be possible to run bazel without server/deamon mode or at least should be compatible with docker 2024 (out of the box).

      running build_ext
      bazel clean --expunge
      Starting local Bazel server and connecting to it...
      INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
      WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 60)
      WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 60)
      WARNING: Waiting for server process to terminate (waited 30 seconds, waiting at most 60)
      INFO: Waited 60 seconds for server process (pid=292) to terminate.
      WARNING: Waiting for server process to terminate (waited 5 seconds, waiting at most 10)
      WARNING: Waiting for server process to terminate (waited 10 seconds, waiting at most 10)
      INFO: Waited 10 seconds for server process (pid=292) to terminate.
      FATAL: Attempted to kill stale server process (pid=292) using SIGKILL, but it did not die in a timely fashion.
      error: command '/usr/bin/bazel' failed with exit code 36
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tink
  Running setup.py clean for tink
Failed to build tink
ERROR: Failed to build one or more wheels
[Pipeline] }

matejsp avatar May 12 '24 17:05 matejsp

I am not sure whether this should be Bazel's business.

Next to docker --init there are other ways to spawn a lightweight init in the container, e.g. https://github.com/phusion/baseimage-docker/blob/rel-0.9.16/image/bin/my_init or https://github.com/Yelp/dumb-init.

meisterT avatar May 13 '24 06:05 meisterT

I don't know but with other build systems we don't have such problem. So it seems it is specific to how bazel works. I have managed to workaround jumping through some hoops to run with --init because I needed to include infra guys ... and now it works with the provided workaround.

https://docs.docker.com/config/containers/multi-service_container/

The container's main process is responsible for managing all processes that it starts. 
In some cases, the main process isn't well-designed, and doesn't handle "reaping"
(stopping) child processes gracefully when the container exits. If your process falls 
into this category, you can use the --init option when you run the container.

So would be nice to have this link in bazel docker docs or fix the issue and have better handling of child processes.

matejsp avatar May 13 '24 08:05 matejsp

Just a note: if you're trying to use [tini](https://github.com/krallin/tini) as init process (Docker uses it for their own official images), make sure to run tini with -s flag specifically -s: Register as a process subreaper (requires Linux >= 3.4)., otherwise just running tini -- won't be enough for this Bazel issue.

/usr/bin/tini -s -- your_ci_or_bazel_command

Also note that ENTRYPOINT set in Dockerfile gets overriden by Kubernetes command, so make sure to wrap the command with init manager in such a case.

Hope this saves someone's time :)

artem-zinnatullin avatar Oct 10 '24 17:10 artem-zinnatullin