cri-dockerd
cri-dockerd copied to clipboard
Defunct processes in containers that use exec probe
I'm hitting a similar problem to https://github.com/kubernetes/kubernetes/issues/81042. exec liveness probes are liable to fail and leave behind defunct processes which prevent further probes from being run, as well as making commands such as docker stats
hang.
I'm asking here because according to https://github.com/kubernetes/kubernetes/issues/81042#issuecomment-840057397, this is a dockershim-specific issue so Kubernetes decided not to fix it. Is there any chance of it being fixed here?
Output from ps -faux
for a test container:
root 2851 0.0 0.1 712640 6556 ? Sl Oct12 1:14 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94 -address /run/containerd/containerd.sock
7347 2944 0.0 0.0 11692 1464 ? Ss Oct12 0:36 \_ bash /usr/bin/logger_script
7347 11644 0.0 0.0 4368 652 ? S 09:17 0:00 | \_ sleep 1
root 5949 0.0 0.1 852800 5792 ? Sl Oct13 0:00 \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94/log.json --log-format json exec --process /tmp/runc-process62819214 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94/ce20bfc804d4d73bf888b8eaca552b89152143cfb748ce286dd13e97b94ad9e8.pid 04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94
7347 5960 0.0 0.0 0 0 ? Zs Oct13 0:00 \_ [ls] <defunct>
Error message from kubectl describe
:
Warning Unhealthy 3m54s (x320 over 31h) kubelet Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Note that I've seen the same problem even in containers which have tini
as their entrypoint. Also, this doesn't always happen; it seems to be more likely when there's more load on the VM, which supports my belief that this is related to https://github.com/containerd/containerd/issues/4255.
Thanks in advance!
Environment:
- CentOS 7 (kernel 5.4.217-1.el7.elrepo.x86_64)
- Kubernetes 1.25.2
- Docker 20.10.18
- containerd 1.6.8
- runc 1.1.4
- cri-dockerd 0.2.6
- network plugin: none (I only care about host networking for my application)
Let me read the commits around fixing it in other CRIs to get an idea of the effort required. The essential problem with all CRIs, really, is that there's nothing to reap zombies, and doing this is implementing a somewhat significant part of what pid 1 normally handles