tilt-extensions
tilt-extensions copied to clipboard
`restart_process` restart-helper leaks zombie processes
root@api-5dd8c8769-nwdv7:/# ps aux wwwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 49 0.0 0.0 4188 3332 pts/0 Ss 14:30 0:00 bash
root 80 0.0 0.0 8088 3940 pts/0 R+ 14:31 0:00 \_ ps aux wwwf
root 1 0.0 0.0 703588 1664 ? Ssl 05:37 0:00 /tilt-restart-wrapper --watch_file=/tmp/.restart-proc sh -c /api-server
root 11 0.0 0.0 1220 104 ? S 05:37 0:00 /entr -rz sh -c /api-server
root 70 0.0 0.0 2576 860 ? S 14:30 0:00 \_ sh -c /api-server
root 71 0.1 0.4 1287044 72528 ? Sl 14:30 0:00 \_ /api-server
root 13 0.0 0.0 0 0 ? Z 05:37 0:02 [api-server] <defunct>
root 38 0.0 0.0 0 0 ? Z 14:29 0:00 [api-server] <defunct>
Seems like it might need a minimal init (tini, dumb-init, etc).
I noticed this because entr also doesn't reliably kill the process before starting a new one... perhaps this is because of sh, I'm not sure. That's a different issue.
hmmm...i was not able to reproduce this problem. here are the steps i tried:
- cd into tilt-extensions/restart_process/test
- Run
tilt up -f custom_deploy.Tiltfile - Clicked
test_updatea bunch of times - Exec'd into the pod and ran
ps aux wwwf - Saw that there was exactly one
start.sh, as i expected.
when i poked around in the entr repo, i found this issue - https://github.com/eradman/entr/pull/38 - which seems to indicate to me that it's at least expected for entr to kill the process
I'm running into a sort of similar situation where a process can be left around hanging as a zombie process.
I tried to create a reproducable testcase and came up with something which creates a zombie process. This is not supposed to be a useful or optimized usecase for anything, just commands put together to create a zombie process.
diff --git a/restart_process/test/Dockerfile.test b/restart_process/test/Dockerfile.test
index cfbeb9f..89e11d3 100644
--- a/restart_process/test/Dockerfile.test
+++ b/restart_process/test/Dockerfile.test
@@ -1,5 +1,7 @@
FROM alpine
+RUN apk add nginx
+
RUN echo 0 > restart_count.txt
ADD start.sh /
diff --git a/restart_process/test/start.sh b/restart_process/test/start.sh
index c9bc981..6277158 100755
--- a/restart_process/test/start.sh
+++ b/restart_process/test/start.sh
@@ -12,8 +12,13 @@ handle_sigterm() {
trap handle_sigterm SIGTERM
-while true
-do
- echo running
- sleep 5
+while pgrep -f nginx >/dev/null ; do
+ echo "waiting for nginx to shut down"
+ set +e
+ pkill -f nginx
+ set -e
+ sleep 3
done
+nginx
+
+tail -F /var/log/nginx/*
- Run
tilt up -f custom_deploy.Tiltfile - The restart should happen right away with the first start.
- The new run of the entrypoint script is left waiting for the previous instance of nginx to shut down, since it will remain there as a zombie process because tilt-restart-wrapper as pid 1 does not call waitpid on it.
Going into the container and running ps faux:
ps faux
PID USER TIME COMMAND
1 root 0:00 /tilt-restart-wrapper --watch_file=/tmp/.restart-proc sh -c /start.sh
17 root 0:00 /entr -rz sh -c /start.sh
25 root 0:00 [nginx]
62 root 0:00 {start.sh} /bin/sh /start.sh
78 root 0:00 sleep 3
79 root 0:00 sh
85 root 0:00 ps faux
The pid 25 is there as a zombie.
I assume the issue is tilt-restart-wrapper not reaping zombies, which should be a job of a process running as PID 1.
Thanks for the reproduction @samuliy. Until this is fixed, there is a simple workaround if this doesn't cause issues with your workload or sidecars:
{{- if .Values.tilt }}
# Fix zombie processes not stopping under Tilt
shareProcessNamespace: true
{{- end }}
@nicks can you remove the needs repro case label?