crun Discrepancy between crun and runc when disallowing access by default to devices with cgroups v1

trafficstars

Hello, thank you for developing crun!

I use Docker containers as CI environments for developing container tools, so I often use OCI runtimes within privileged Docker containers. I noticed that on systems with cgroups v1, when the bundle's config.json is set to disallow access to all devices by default, crun apparently allows all container devices, while runc abides to the config (besides the essential special devices it sets up on its own).

For example, within a Fedora 39 Docker container:

[root@39f2b2db9bb6 /]# runc --version
runc version 1.1.12
spec: 1.0.2-dev
go: go1.21.6
libseccomp: 2.5.3
[root@39f2b2db9bb6 /]# crun --version
crun version 1.14.4
commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
rundir: /run/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL

[root@39f2b2db9bb6 /]# cat /sys/fs/cgroup/devices/devices.list 
a *:* rwm

# cd to an OCI bundle with a Ubuntu rootfs
[root@39f2b2db9bb6 /]# cd oci-bundle/
[root@39f2b2db9bb6 oci-bundle]# ls -l
total 4
-rw-r--r-- 1 1000 users 2700 Mar 13 18:54 config.json
drwxr-xr-x 1 1000 users  154 Mar 13 16:24 rootfs

[root@39f2b2db9bb6 oci-bundle]# runc run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list 
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
docker@39f2b2db9bb6:/$ 
exit

[root@39f2b2db9bb6 oci-bundle]# crun run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list 
a *:* rwm
docker@39f2b2db9bb6:/$       
exit

The config.json is the following:

{
   "ociVersion": "1.0.0",
   "process": {
      "terminal": true,
      "user": {
         "uid": 1000,
         "gid": 1000,
         "additionalGids": [
            1000
         ]
      },
      "args": [
         "bash"
      ],
      "env": [
         "SHLVL=1",
         "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
         "TERM=xterm",
         "HOME=/home/docker",
         "PWD=/home/docker"
      ],
      "cwd": "/",
      "capabilities": {},
      "noNewPrivileges": true
   },
   "root": {
      "path": "rootfs",
      "readonly": false
   },
   "mounts": [
      {
         "destination": "/proc",
         "type": "proc",
         "source": "proc"
      },
      {
         "destination": "/dev/pts",
         "type": "devpts",
         "source": "devpts",
         "options": [
            "nosuid",
            "noexec",
            "newinstance",
            "ptmxmode=0666",
            "mode=0620",
            "gid=5"
         ]
      },
      {
         "destination": "/dev/shm",
         "type": "bind",
         "source": "/dev/shm",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "rbind",
            "slave",
            "rw"
         ]
      },
      {
         "destination": "/dev/mqueue",
         "type": "mqueue",
         "source": "mqueue",
         "options": [
            "nosuid",
            "noexec",
            "nodev"
         ]
      },
      {
         "destination": "/sys",
         "type": "sysfs",
         "source": "sysfs",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "ro"
         ]
      },
      {
         "destination": "/sys/fs/cgroup",
         "type": "cgroup",
         "source": "cgroup",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "relatime",
            "ro"
         ]
      }
   ],
   "linux": {
      "resources": {
         "cpu": {
            "cpus": "0,1,2,3,4,5,6,7"
         },
         "devices": [
            {
               "allow": false,
               "access": "rwm"
            }
         ]
      },
      "namespaces": [
         {
            "type": "mount"
         }
      ],
      "rootfsPropagation": "slave",
      "maskedPaths": [
         "/proc/kcore",
         "/proc/latency_stats",
         "/proc/timer_list",
         "/proc/timer_stats",
         "/proc/sched_debug",
         "/sys/firmware",
         "/proc/scsi"
      ],
      "readonlyPaths": [
         "/proc/asound",
         "/proc/bus",
         "/proc/fs",
         "/proc/irq",
         "/proc/sys",
         "/proc/sysrq-trigger"
      ]
   }
}

The configuration of a privileged container (no user namespace) is intentional in this case.

I can reproduce the behavior described above only when calling crun within Docker containers, not when using it on native hosts. What am I missing?

Thanks in advance for any help provided!

Mar 14 '24 11:03 Madeeks

I tried reproducing it using a Podman container created with podman run --privileged -v /root:/root --rm -ti fedora:39 bash but in both cases the inner container does not create a cgroup.

how have you created the outer Docker container?

Can you please verify the cgroup of the container process cat /proc/$PID_CGROUP/cgroup from the host in both cases?

Mar 14 '24 13:03 giuseppe

I tried reproducing it using a Podman container created with podman run --privileged -v /root:/root --rm -ti fedora:39 bash but in both cases the inner container does not create a cgroup.

EDIT: I was looking at the wrong thing.

They both create a cgroup, but I see the same configuration:

# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/runc-container/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm

# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/crun-container/devices.list
c *:* m
b *:* m
c 1:3 rwm
c 1:8 rwm
c 1:7 rwm
c 5:0 rwm
c 1:5 rwm
c 1:9 rwm
c 5:1 rwm
c 136:* rwm
c 5:2 rwm

Mar 14 '24 13:03 giuseppe

Hi @giuseppe, thanks for your reply. The outer container is created with a command like

> docker run --rm -it -v $(pwd):/oci-bundle --privileged fedora:39 bash

The Docker config I'm running on my laptop is

> docker info
Client:
 Version:    24.0.7-ce
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.11.2
    Path:     /usr/lib/docker/cli-plugins/docker-buildx

Server:
 Containers: 6
  Running: 1
  Paused: 0
  Stopped: 5
 Images: 274
 Server Version: 24.0.7-ce
 Storage Driver: btrfs
  Btrfs: 
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 oci runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8e4b0bde866788eec76735cc77c4720144248fb7
 runc version: v1.1.10-0-g18a0cb0f32bc
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.14.21-150400.24.81-default
 Operating System: openSUSE Leap 15.4
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.06GiB
 Name: carbon
 ID: 7IY2:7RUT:5VJZ:QQKN:S75T:CZBI:VM4J:UHRR:JT5K:75EH:FCU5:IKA7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: madeeks
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a cgroupfs driver. I can also reproduce the results with Podman using both systemd and cgroupfs arguments to --cgroup-manager (in this case I ran rootful Podman since the option is not supported with rootless Podman and cgroups v1).

I'll keep digging in how my container engines are setting up cgroups in the outer containers.

Mar 20 '24 14:03 Madeeks

I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a cgroupfs driver.

do you get the same results with runc and crun?

Mar 20 '24 14:03 giuseppe

Apologies for the ambiguous wording.

I get the same results on my OpenSUSE laptop and the Ubuntu 20.04 VM. That is, on both systems the device cgroups produced by runc and crun are different (when the runtimes are started inside a Docker container).

The Docker cgroup driver is different between the 2 platforms: systemd on the OpenSUSE laptop, cgroupfs on the Ubuntu VM.

If I create the outside container with Podman, I still obtain crun/runc device cgroup differences. This happens even when using different values for Podman's --cgroup-manager option.

Mar 25 '24 13:03 Madeeks

crun crun copied to clipboard

Discrepancy between crun and runc when disallowing access by default to devices with cgroups v1

crun
crun copied to clipboard