cri-dockerd icon indicating copy to clipboard operation
cri-dockerd copied to clipboard

Defunct processes in containers that use exec probe

Open matthewtorr-msft opened this issue 2 years ago • 1 comments

I'm hitting a similar problem to https://github.com/kubernetes/kubernetes/issues/81042. exec liveness probes are liable to fail and leave behind defunct processes which prevent further probes from being run, as well as making commands such as docker stats hang.

I'm asking here because according to https://github.com/kubernetes/kubernetes/issues/81042#issuecomment-840057397, this is a dockershim-specific issue so Kubernetes decided not to fix it. Is there any chance of it being fixed here?

Output from ps -faux for a test container:

root      2851  0.0  0.1 712640  6556 ?        Sl   Oct12   1:14 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94 -address /run/containerd/containerd.sock
7347      2944  0.0  0.0  11692  1464 ?        Ss   Oct12   0:36  \_ bash /usr/bin/logger_script
7347     11644  0.0  0.0   4368   652 ?        S    09:17   0:00  |   \_ sleep 1
root      5949  0.0  0.1 852800  5792 ?        Sl   Oct13   0:00  \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94/log.json --log-format json exec --process /tmp/runc-process62819214 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94/ce20bfc804d4d73bf888b8eaca552b89152143cfb748ce286dd13e97b94ad9e8.pid 04443ab3a9f762d864ca0090bb4f793034adc42a834556abc36755b961075a94
7347      5960  0.0  0.0      0     0 ?        Zs   Oct13   0:00      \_ [ls] <defunct>

Error message from kubectl describe:

Warning  Unhealthy  3m54s (x320 over 31h)  kubelet  Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Note that I've seen the same problem even in containers which have tini as their entrypoint. Also, this doesn't always happen; it seems to be more likely when there's more load on the VM, which supports my belief that this is related to https://github.com/containerd/containerd/issues/4255. Thanks in advance!


Environment:

  • CentOS 7 (kernel 5.4.217-1.el7.elrepo.x86_64)
  • Kubernetes 1.25.2
  • Docker 20.10.18
  • containerd 1.6.8
  • runc 1.1.4
  • cri-dockerd 0.2.6
    • network plugin: none (I only care about host networking for my application)

matthewtorr-msft avatar Oct 14 '22 11:10 matthewtorr-msft

Let me read the commits around fixing it in other CRIs to get an idea of the effort required. The essential problem with all CRIs, really, is that there's nothing to reap zombies, and doing this is implementing a somewhat significant part of what pid 1 normally handles

evol262 avatar Oct 18 '22 10:10 evol262