conmon icon indicating copy to clipboard operation
conmon copied to clipboard

conmon leave main container process in zombie state

Open Leo1003 opened this issue 3 months ago • 5 comments

I observed a strange behavior. We run Prometheus on our Kuberentes, however, it usually gets stuck when Kubernetes restart the container.

CRI-O logs:

Mar 27 22:35:38 k8s-sys-w1 crio[1364]: time="2024-03-27 22:35:38.600423205+08:00" level=info msg="Stopping container: 03d1a6981d1eaa40c200f4555cd41f66e2567fc7bd7573dc979af67e22ce3bc6 (timeout: 600s)" id=c60a252b-7eb3-4337-96aa-398fb115db16 name=/runtime.v1.RuntimeService/StopContainer
Mar 27 22:45:38 k8s-sys-w1 crio[1364]: time="2024-03-27 22:45:38.615546831+08:00" level=warning msg="Stopping container 03d1a6981d1eaa40c200f4555cd41f66e2567fc7bd7573dc979af67e22ce3bc6 with stop signal timed out: timeout reached after 600 seconds waiting for container process to exit" id=c60a252b-7eb3-4337-96aa-398fb115db16 name=/runtime.v1.RuntimeService/StopContainer
Mar 27 22:47:38 k8s-sys-w1 crio[1364]: time="2024-03-27 22:47:38.945321438+08:00" level=info msg="Stopping container: 03d1a6981d1eaa40c200f4555cd41f66e2567fc7bd7573dc979af67e22ce3bc6 (timeout: 600s)" id=eb4ce11f-99d2-40d4-8048-68a9f918a943 name=/runtime.v1.RuntimeService/StopContainer
Mar 27 22:57:38 k8s-sys-w1 crio[1364]: time="2024-03-27 22:57:38.959805291+08:00" level=warning msg="Stopping container 03d1a6981d1eaa40c200f4555cd41f66e2567fc7bd7573dc979af67e22ce3bc6 with stop signal timed out: timeout reached after 600 seconds waiting for container process to exit" id=eb4ce11f-99d2-40d4-8048-68a9f918a943 name=/runtime.v1.RuntimeService/StopContainer

The prometheus process should be the PID 1 in the PID namespace, after it died, the whole namespace should be killed by kernel.

However, the conmon leave the prometheus process in zombie state. Thus, the container get stuck.

$ sudo pstree -plTS 1859180
conmon(1859180)───prometheus(1859182,pid)
$ ps ufS 1859180 1859182
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1859180  0.0  0.0  70152  2284 ?        Ss   Mar27   0:00 /usr/bin/conmon <omitted...>
ansible  1859182  0.0  0.0      0     0 ?        Zsl  Mar27   0:00  \_ [prometheus] <defunct>

I tried to use strace to see what conmon is doing, and sending some SIGCHLD signal in another terminal.

$ sudo strace -p 1859180
strace: Process 1859180 attached
restart_syscall(<... resuming interrupted restart_syscall ...>) = 1
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
read(17, "\21\0\0\0\0\0\0\0\0\0\0\0\\\352(\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
wait4(-1, 0x7ffec7059b70, WNOHANG, NULL) = 0
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}], 7, -1) = 1 ([{fd=5, revents=POLLIN}])
read(5, "\2\0\0\0\0\0\0\0", 16)         = 8
poll([{fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}], 7, -1) = 1 ([{fd=17, revents=POLLIN}])
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
read(17, "\21\0\0\0\0\0\0\0\0\0\0\0>\3)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
wait4(-1, 0x7ffec7059b70, WNOHANG, NULL) = 0
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}], 7, -1) = 1 ([{fd=5, revents=POLLIN}])
read(5, "\2\0\0\0\0\0\0\0", 16)         = 8
poll([{fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}], 7, -1

Although conmon did call wait4(), however the kernel return with 0. (meaning no process can be waited)

Some system information:

$ crun --version
crun version 1.9.2
commit: 35274d346d2e9ffeacb22cc11590b0266a23d634
rundir: /run/user/17247/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
$ conmon --version
conmon version 2.1.8
commit: 97585789fa36b1cfaf71f2d23b8add1a41f95f50
$ crictl -v
crictl version v1.26.0
$ uname -a
Linux k8s-sys-w1 4.18.0-526.el8.x86_64 #1 SMP Sat Nov 18 00:54:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Leo1003 avatar Mar 29 '24 10:03 Leo1003