crun icon indicating copy to clipboard operation
crun copied to clipboard

Pod cannot be deleted due to missing container startup command when using crun

Open Bevisy opened this issue 1 year ago • 8 comments

What happened?

using pod-config.json and container-config.json to create pod:

# cat pod-config.json
{
    "metadata": {
        "name": "nginx-sandbox",
        "namespace": "default",
        "attempt": 1,
        "uid": "hdishd83djaidwnduwk28bcsb"
    },
    "log_directory": "/tmp",
    "linux": {
    }
}

# cat container-config-nginx.json
{
  "metadata": {
      "name": "nginx-0"
  },
  "image":{
      "image": "docker.io/library/nginx:latest"
  },
  "command": [
      "top"
  ],
  "linux": {
  }
}

Then, we could find the container was created failed:

# crictl run container-config-nginx.json pod-config.json
FATA[0012] running container: creating container failed: rpc error: code = Unknown desc = create container: create result: internal/proto/conmon.capnp:Conmon.createContainer: Failed: child command exited with: 1: executable file `top` not found in $PATH: No such file or directory

At this point, the container process on the node becomes a zombie process, and the pod cannot be deleted.

      1   15487   15486    2552 pts/1      11037 Sl       0   0:00 /usr/bin/crio-conmonrs --runtime /usr/bin/crio-crun --runtime-dir /var/lib/containers/storage/overlay-containers/7d46c4f2908be02f02465923ca1aca87295e8872231dae236287fe69209fdec9/userdata --runtime-root /run/crun --log-level info --log-driver systemd --cgroup-manager systemd
  15487   15496   15496   15496 ?             -1 Ss       0   0:00  \_ /pause
  15487   15509   15486    2552 pts/1      11037 Z        0   0:00  \_ [3] <defunct>

However, this issue does not occur when using runc:

      1    9191    9190    2127 pts/1       9081 Sl       0   0:00 /usr/bin/crio-conmonrs --runtime /usr/bin/crio-runc --runtime-dir /var/lib/containers/storage/overlay-containers/408c6d69af793e8a90489a61b250741f0b47d6cce8a140e28b4b604e06cae0f0/userdata --runtime-root /run/runc --log-level info --log-driver systemd --cgroup-manager systemd
   9191    9209    9209    9209 ?             -1 Ss       0   0:00  \_ /pause

So, what could be the reason for this?

What did you expect to happen?

Expect the container process to exit normally instead of becoming a zombie process.

How can we reproduce it (as minimally and precisely as possible)?

See what happened.

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.31.0
Version:        1.31.0
GitCommit:      a51dfb336a1d3847415dfa871e81d003e4ef79ae
GitCommitDate:  2024-05-21T07:18:21Z
GitTreeState:   dirty
GoVersion:      go1.22.3
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  containers_image_ostree_stub
  libdm_no_deferred_remove
  seccomp
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false
$ crio-conmonrs
v0.6.3

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux lima-crio 6.1.0-21-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

nothing else

Bevisy avatar Jun 11 '24 06:06 Bevisy

The fact that it works with runc and conmon-rs makes me wonder what crun does differently.

OTOH the above use case works pretty well when using:

  • crun and conmon
  • runc and conmon
  • runc and conmon-rs

saschagrunert avatar Jun 11 '24 12:06 saschagrunert

what is the command line used by conmon-rs to run crun?

giuseppe avatar Jun 12 '24 15:06 giuseppe

I added some debug statement to conmon-rs and it runs:

crun \
    --root=/run/runc \
    --systemd-cgroup \
    create \
    --bundle /run/containers/storage/overlay-containers/dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f/userdata \
    --pid-file /run/containers/storage/overlay-containers/dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f/userdata/pidfile \
    dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f

saschagrunert avatar Jun 13 '24 06:06 saschagrunert

thanks. The command line looks correct

giuseppe avatar Jun 13 '24 12:06 giuseppe

Hm, I assume that this issue is related to the hanging AppArmor tests in critest. The container creation fails and we're not able to cleanup the pod afterwards.

saschagrunert avatar Aug 12 '24 13:08 saschagrunert

@giuseppe thinking about this: an issue could be that conmon-rs is capable of running multiple containers under a single binary compared to conmon.

In conmon-rs, we:

  1. run the sandbox container (which succeeds)
  2. create the actual container workload using the same conmon-rs instance (errors directly on startup)

When using runc, conmon-rs exits the pause container on waitpid: https://github.com/containers/conmon-rs/blob/70f39ef25/conmon-rs/server/src/child_reaper.rs#L411-L416

That's not the case for crun, which hangs and carries the failed process as zombie. Is it possible that some re-parenting does not work as expected?

saschagrunert avatar Aug 14 '24 07:08 saschagrunert

Side effect: the problem does not occur when the container runs with enabled pid namespace option 1:

Container:

{
  "metadata": {
    "name": "podsandbox1-redis"
  },
  "image": {
    "image": "quay.io/crio/fedora-crio-ci:latest"
  },
  "command": ["wrong"],
  "linux": {
    "security_context": {
      "namespace_options": {
        "pid": 1
      }
    }
  }
}

Sandbox:

{
  "metadata": {
    "name": "podsandbox1",
    "uid": "redhat-test-crio",
    "namespace": "redhat.test.crio",
    "attempt": 1
  }
}

It seems that the code paths diverge there: https://github.com/containers/crun/blob/26c7687b4a555187666c5a1afda049be185c1225/src/libcrun/container.c#L1718-L1742

saschagrunert avatar Aug 14 '24 07:08 saschagrunert

cc @kolyshkin

saschagrunert avatar Aug 14 '24 18:08 saschagrunert

is the pid namespace disabled for the container and is it expected to join the same one as the pause process or do both use the host pid namespace?

giuseppe avatar Sep 03 '24 13:09 giuseppe

we could address this specific case in crun, but to me it looks more of a problem in conmon-rs. It sets prctl::set_child_subreaper(true), but apparently it doesn't do waitpid(-1, &wstatus, 0);. It could also be non blocking with WNOHANG so you do the cleanup occasionally, to make sure zombies are not left around

giuseppe avatar Sep 09 '24 11:09 giuseppe

@giuseppe after some more investigation I don't think it's related to the waitpid itself. In conmon-rs, we:

  • Spawn the crun process like this:
Running: "/home/sascha/git/crun/crun" ["--log=journald:5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1", "--log-level=debug", "--root=/run/crun", "--systemd-cgroup", "--root=/run/crun", "--systemd-cgroup", "create", "--bundle", "/run/containers/storage/overlay-containers/5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1/userdata", "--pid-file", "/run/containers/storage/overlay-containers/5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1/userdata/pidfile", "5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1"]

https://github.com/containers/conmon-rs/blob/f4f22d081b024467466acb3717bc0856b6b3d37d/conmon-rs/server/src/child_reaper.rs#L96

  • Then we evaluate the error (executable not found in $PATH and return correctly in conmon-rs to CRI-O: https://github.com/containers/conmon-rs/blob/f4f22d081b024467466acb3717bc0856b6b3d37d/conmon-rs/server/src/child_reaper.rs#L140

After that, conmon-rs is still running because it still holds the pause process, which is intentional.

The issue is that we still have crun ([3]) running as zombie process:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1112014  0.0  0.0      0     0 ?        Zs   12:13   0:00 [3] <defunct>

Interestingly, 1112014 is not the PID from cmd.spawn() in conmon-rs, it's 1112012. So I'm not sure how to even handle that process in conmon-rs. The crun logs also indicate that 1112014 is the actual container PID, which is already unavailable in conmon-rs because the init command failed.

I feel that we should cleanup that PID in crun, do you have more insights into that?

saschagrunert avatar Sep 10 '24 10:09 saschagrunert

when a process (conmon-rs in this case) sets itself as set_child_subreaper() then each zombie process that happens to a process that is in a subtree is reparented to that initial process that is a subreaper.

We could fix this specific case, but there is the risk that other processes won't be cleaned up, this could happen for example if you run a container without a PID namespace. Each zombie in the container is reparented to conmon-rs.

conmon implements this logic here: https://github.com/containers/conmon/blob/main/src/ctr_exit.c#L44

giuseppe avatar Sep 10 '24 10:09 giuseppe