crun
crun copied to clipboard
Discrepancy between crun and runc when disallowing access by default to devices with cgroups v1
Hello, thank you for developing crun!
I use Docker containers as CI environments for developing container tools, so I often use OCI runtimes within privileged Docker containers.
I noticed that on systems with cgroups v1, when the bundle's config.json is set to disallow access to all devices by default, crun apparently allows all container devices, while runc abides to the config (besides the essential special devices it sets up on its own).
For example, within a Fedora 39 Docker container:
[root@39f2b2db9bb6 /]# runc --version
runc version 1.1.12
spec: 1.0.2-dev
go: go1.21.6
libseccomp: 2.5.3
[root@39f2b2db9bb6 /]# crun --version
crun version 1.14.4
commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
rundir: /run/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
[root@39f2b2db9bb6 /]# cat /sys/fs/cgroup/devices/devices.list
a *:* rwm
# cd to an OCI bundle with a Ubuntu rootfs
[root@39f2b2db9bb6 /]# cd oci-bundle/
[root@39f2b2db9bb6 oci-bundle]# ls -l
total 4
-rw-r--r-- 1 1000 users 2700 Mar 13 18:54 config.json
drwxr-xr-x 1 1000 users 154 Mar 13 16:24 rootfs
[root@39f2b2db9bb6 oci-bundle]# runc run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
docker@39f2b2db9bb6:/$
exit
[root@39f2b2db9bb6 oci-bundle]# crun run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list
a *:* rwm
docker@39f2b2db9bb6:/$
exit
The config.json is the following:
{
"ociVersion": "1.0.0",
"process": {
"terminal": true,
"user": {
"uid": 1000,
"gid": 1000,
"additionalGids": [
1000
]
},
"args": [
"bash"
],
"env": [
"SHLVL=1",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm",
"HOME=/home/docker",
"PWD=/home/docker"
],
"cwd": "/",
"capabilities": {},
"noNewPrivileges": true
},
"root": {
"path": "rootfs",
"readonly": false
},
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/dev/shm",
"options": [
"nosuid",
"noexec",
"nodev",
"rbind",
"slave",
"rw"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
}
],
"linux": {
"resources": {
"cpu": {
"cpus": "0,1,2,3,4,5,6,7"
},
"devices": [
{
"allow": false,
"access": "rwm"
}
]
},
"namespaces": [
{
"type": "mount"
}
],
"rootfsPropagation": "slave",
"maskedPaths": [
"/proc/kcore",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/sys/firmware",
"/proc/scsi"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
The configuration of a privileged container (no user namespace) is intentional in this case.
I can reproduce the behavior described above only when calling crun within Docker containers, not when using it on native hosts. What am I missing?
Thanks in advance for any help provided!
I tried reproducing it using a Podman container created with podman run --privileged -v /root:/root --rm -ti fedora:39 bash but in both cases the inner container does not create a cgroup.
how have you created the outer Docker container?
Can you please verify the cgroup of the container process cat /proc/$PID_CGROUP/cgroup from the host in both cases?
I tried reproducing it using a Podman container created with
podman run --privileged -v /root:/root --rm -ti fedora:39 bashbut in both cases the inner container does not create a cgroup.
EDIT: I was looking at the wrong thing.
They both create a cgroup, but I see the same configuration:
# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/runc-container/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/crun-container/devices.list
c *:* m
b *:* m
c 1:3 rwm
c 1:8 rwm
c 1:7 rwm
c 5:0 rwm
c 1:5 rwm
c 1:9 rwm
c 5:1 rwm
c 136:* rwm
c 5:2 rwm
Hi @giuseppe, thanks for your reply. The outer container is created with a command like
> docker run --rm -it -v $(pwd):/oci-bundle --privileged fedora:39 bash
The Docker config I'm running on my laptop is
> docker info
Client:
Version: 24.0.7-ce
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: 0.11.2
Path: /usr/lib/docker/cli-plugins/docker-buildx
Server:
Containers: 6
Running: 1
Paused: 0
Stopped: 5
Images: 274
Server Version: 24.0.7-ce
Storage Driver: btrfs
Btrfs:
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 oci runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8e4b0bde866788eec76735cc77c4720144248fb7
runc version: v1.1.10-0-g18a0cb0f32bc
init version:
Security Options:
apparmor
seccomp
Profile: builtin
Kernel Version: 5.14.21-150400.24.81-default
Operating System: openSUSE Leap 15.4
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.06GiB
Name: carbon
ID: 7IY2:7RUT:5VJZ:QQKN:S75T:CZBI:VM4J:UHRR:JT5K:75EH:FCU5:IKA7
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: madeeks
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a cgroupfs driver.
I can also reproduce the results with Podman using both systemd and cgroupfs arguments to --cgroup-manager (in this case I ran rootful Podman since the option is not supported with rootless Podman and cgroups v1).
I'll keep digging in how my container engines are setting up cgroups in the outer containers.
I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a
cgroupfsdriver.
do you get the same results with runc and crun?
Apologies for the ambiguous wording.
I get the same results on my OpenSUSE laptop and the Ubuntu 20.04 VM. That is, on both systems the device cgroups produced by runc and crun are different (when the runtimes are started inside a Docker container).
The Docker cgroup driver is different between the 2 platforms: systemd on the OpenSUSE laptop, cgroupfs on the Ubuntu VM.
If I create the outside container with Podman, I still obtain crun/runc device cgroup differences.
This happens even when using different values for Podman's --cgroup-manager option.