crun
crun copied to clipboard
Pod cannot be deleted due to missing container startup command when using crun
What happened?
using pod-config.json and container-config.json to create pod:
# cat pod-config.json
{
"metadata": {
"name": "nginx-sandbox",
"namespace": "default",
"attempt": 1,
"uid": "hdishd83djaidwnduwk28bcsb"
},
"log_directory": "/tmp",
"linux": {
}
}
# cat container-config-nginx.json
{
"metadata": {
"name": "nginx-0"
},
"image":{
"image": "docker.io/library/nginx:latest"
},
"command": [
"top"
],
"linux": {
}
}
Then, we could find the container was created failed:
# crictl run container-config-nginx.json pod-config.json
FATA[0012] running container: creating container failed: rpc error: code = Unknown desc = create container: create result: internal/proto/conmon.capnp:Conmon.createContainer: Failed: child command exited with: 1: executable file `top` not found in $PATH: No such file or directory
At this point, the container process on the node becomes a zombie process, and the pod cannot be deleted.
1 15487 15486 2552 pts/1 11037 Sl 0 0:00 /usr/bin/crio-conmonrs --runtime /usr/bin/crio-crun --runtime-dir /var/lib/containers/storage/overlay-containers/7d46c4f2908be02f02465923ca1aca87295e8872231dae236287fe69209fdec9/userdata --runtime-root /run/crun --log-level info --log-driver systemd --cgroup-manager systemd
15487 15496 15496 15496 ? -1 Ss 0 0:00 \_ /pause
15487 15509 15486 2552 pts/1 11037 Z 0 0:00 \_ [3] <defunct>
However, this issue does not occur when using runc:
1 9191 9190 2127 pts/1 9081 Sl 0 0:00 /usr/bin/crio-conmonrs --runtime /usr/bin/crio-runc --runtime-dir /var/lib/containers/storage/overlay-containers/408c6d69af793e8a90489a61b250741f0b47d6cce8a140e28b4b604e06cae0f0/userdata --runtime-root /run/runc --log-level info --log-driver systemd --cgroup-manager systemd
9191 9209 9209 9209 ? -1 Ss 0 0:00 \_ /pause
So, what could be the reason for this?
What did you expect to happen?
Expect the container process to exit normally instead of becoming a zombie process.
How can we reproduce it (as minimally and precisely as possible)?
See what happened.
Anything else we need to know?
No response
CRI-O and Kubernetes version
$ crio --version
crio version 1.31.0
Version: 1.31.0
GitCommit: a51dfb336a1d3847415dfa871e81d003e4ef79ae
GitCommitDate: 2024-05-21T07:18:21Z
GitTreeState: dirty
GoVersion: go1.22.3
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
BuildTags:
containers_image_ostree_stub
libdm_no_deferred_remove
seccomp
selinux
LDFlags: unknown
SeccompEnabled: true
AppArmorEnabled: false
$ crio-conmonrs
v0.6.3
OS version
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux lima-crio 6.1.0-21-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux
Additional environment details (AWS, VirtualBox, physical, etc.)
The fact that it works with runc and conmon-rs makes me wonder what crun does differently.
OTOH the above use case works pretty well when using:
- crun and conmon
- runc and conmon
- runc and conmon-rs
what is the command line used by conmon-rs to run crun?
I added some debug statement to conmon-rs and it runs:
crun \
--root=/run/runc \
--systemd-cgroup \
create \
--bundle /run/containers/storage/overlay-containers/dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f/userdata \
--pid-file /run/containers/storage/overlay-containers/dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f/userdata/pidfile \
dc31d87ed3b6530f23411374f44d4d84b4da8812d8af9b0e90258e04eb2ad03f
thanks. The command line looks correct
Hm, I assume that this issue is related to the hanging AppArmor tests in critest. The container creation fails and we're not able to cleanup the pod afterwards.
@giuseppe thinking about this: an issue could be that conmon-rs is capable of running multiple containers under a single binary compared to conmon.
In conmon-rs, we:
- run the sandbox container (which succeeds)
- create the actual container workload using the same conmon-rs instance (errors directly on startup)
When using runc, conmon-rs exits the pause container on waitpid: https://github.com/containers/conmon-rs/blob/70f39ef25/conmon-rs/server/src/child_reaper.rs#L411-L416
That's not the case for crun, which hangs and carries the failed process as zombie. Is it possible that some re-parenting does not work as expected?
Side effect: the problem does not occur when the container runs with enabled pid namespace option 1:
Container:
{
"metadata": {
"name": "podsandbox1-redis"
},
"image": {
"image": "quay.io/crio/fedora-crio-ci:latest"
},
"command": ["wrong"],
"linux": {
"security_context": {
"namespace_options": {
"pid": 1
}
}
}
}
Sandbox:
{
"metadata": {
"name": "podsandbox1",
"uid": "redhat-test-crio",
"namespace": "redhat.test.crio",
"attempt": 1
}
}
It seems that the code paths diverge there: https://github.com/containers/crun/blob/26c7687b4a555187666c5a1afda049be185c1225/src/libcrun/container.c#L1718-L1742
cc @kolyshkin
is the pid namespace disabled for the container and is it expected to join the same one as the pause process or do both use the host pid namespace?
we could address this specific case in crun, but to me it looks more of a problem in conmon-rs. It sets prctl::set_child_subreaper(true), but apparently it doesn't do waitpid(-1, &wstatus, 0);. It could also be non blocking with WNOHANG so you do the cleanup occasionally, to make sure zombies are not left around
@giuseppe after some more investigation I don't think it's related to the waitpid itself. In conmon-rs, we:
- Spawn the crun process like this:
Running: "/home/sascha/git/crun/crun" ["--log=journald:5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1", "--log-level=debug", "--root=/run/crun", "--systemd-cgroup", "--root=/run/crun", "--systemd-cgroup", "create", "--bundle", "/run/containers/storage/overlay-containers/5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1/userdata", "--pid-file", "/run/containers/storage/overlay-containers/5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1/userdata/pidfile", "5a0349019be0804088f20bb29c97cc3cd683d81604f952098163a64a262b1ba1"]
https://github.com/containers/conmon-rs/blob/f4f22d081b024467466acb3717bc0856b6b3d37d/conmon-rs/server/src/child_reaper.rs#L96
- Then we evaluate the error (executable not found in
$PATHand return correctly in conmon-rs to CRI-O: https://github.com/containers/conmon-rs/blob/f4f22d081b024467466acb3717bc0856b6b3d37d/conmon-rs/server/src/child_reaper.rs#L140
After that, conmon-rs is still running because it still holds the pause process, which is intentional.
The issue is that we still have crun ([3]) running as zombie process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1112014 0.0 0.0 0 0 ? Zs 12:13 0:00 [3] <defunct>
Interestingly, 1112014 is not the PID from cmd.spawn() in conmon-rs, it's 1112012. So I'm not sure how to even handle that process in conmon-rs. The crun logs also indicate that 1112014 is the actual container PID, which is already unavailable in conmon-rs because the init command failed.
I feel that we should cleanup that PID in crun, do you have more insights into that?
when a process (conmon-rs in this case) sets itself as set_child_subreaper() then each zombie process that happens to a process that is in a subtree is reparented to that initial process that is a subreaper.
We could fix this specific case, but there is the risk that other processes won't be cleaned up, this could happen for example if you run a container without a PID namespace. Each zombie in the container is reparented to conmon-rs.
conmon implements this logic here: https://github.com/containers/conmon/blob/main/src/ctr_exit.c#L44