runc cgroup/systemd: fix making CharDevice path in systemdProperties

cgroup/systemd: func systemdProperties will set CharDevice path like /dev/char/0:0,

but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x, getNVIDIAEntryPath will fix this problem.

Signed-off-by: yangfeiyu20102011 [email protected]

Aug 23 '22 07:08 yangfeiyu20102011

PTAL, thanks! cc @AkihiroSuda @thaJeztah https://github.com/opencontainers/runc/issues/3567

Aug 23 '22 11:08 yangfeiyu20102011

Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?

Thanks, I have modified the code. When doing runc update, it will skip setting the cgroup device files. If the spec contains devices like /dev/nvidia*, it will make the DeviceAllow.conf as follow.

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DeviceAllow.conf
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/195:0 rw
DeviceAllow=/dev/char/195:1 rw

The DeviceAllow=/dev/char/195:0 rw will not work.

And if DevicePolicy.conf set DevicePolicy=strict, the devices.list may end in c 195:* m after setUnitProperties(m.dbus, unitName, properties...)

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DevicePolicy.conf
[Scope]
DevicePolicy=strict

cat /sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podeeccd2f8_7bef_4054_a659_6554b908432a.slice/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope/devices.list
c 136:* rwm
c 5:2 rwm
c 195:* m

Aug 23 '22 13:08 yangfeiyu20102011

Indeed, not all character devices have /dev/char/MM:mm equivalent for some reason. Here's what I found on my machine (Fedora 36 laptop running kernel 5.17.14-300.fc36.x86_64):

[kir@kir-rhat linux]$ ls -lR /dev | grep ^c | awk '{print $10, $5, $6}' | sed -e 's|, |:|' -e 's| | /dev/char/|' | awk '{printf "/dev/" $1 "\t"; system("ls -l " $2);}' 2>&1 | grep cannot
/dev/cuse	ls: cannot access '/dev/char/10:203': No such file or directory
/dev/lp0	ls: cannot access '/dev/char/6:0': No such file or directory
/dev/lp1	ls: cannot access '/dev/char/6:1': No such file or directory
/dev/lp2	ls: cannot access '/dev/char/6:2': No such file or directory
/dev/lp3	ls: cannot access '/dev/char/6:3': No such file or directory
/dev/ppp	ls: cannot access '/dev/char/108:0': No such file or directory
/dev/uhid	ls: cannot access '/dev/char/10:239': No such file or directory
/dev/uinput	ls: cannot access '/dev/char/10:223': No such file or directory
/dev/vhci	ls: cannot access '/dev/char/10:137': No such file or directory
/dev/vhost-vsock	ls: cannot access '/dev/char/10:241': No such file or directory

(there might be some more char devices in subdirectories of /dev)

For block devices, I haven't found any that does not have a symlink in /dev/block.

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found. WDYT @cyphar

Aug 24 '22 01:08 kolyshkin

Now the NVIDIA devices in DeviceAllow.conf are not as expected. This PR can solve some NVIDIA GPU rw problems and it is a improved method at least. We can completely solve the char devices problem in a better way in the future. cc @thaJeztah @cyphar

Aug 26 '22 02:08 yangfeiyu20102011

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

Aug 31 '22 02:08 kolyshkin

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

cc @kolyshkin OK, here are the spec and DeviceAllow.conf oci spec

{
    "ociVersion": "1.0.2-dev",
    "process":
    {
        "user":
        {
            "uid": 0,
            "gid": 0
        },
        "args":
        [
            "sleep",
            "36000"
        ],
        "env":
        [
            "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "HOSTNAME=gpu-operator-test",
            "NVARCH=x86_64",
            "NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419",
            "NV_CUDA_CUDART_VERSION=11.0.221-1",
            "NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-0",
            "CUDA_VERSION=11.0.3",
            "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
            "NVIDIA_VISIBLE_DEVICES=all",
            "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
            "NVIDIA_VISIBLE_DEVICES=GPU-44f83262-58b8-2db0-7960-01d193fcf7b5",
            "NOT_HOST_NETWORK=true",
            "KUBERNETES_SERVICE_PORT=443",
            "KUBERNETES_SERVICE_PORT_HTTPS=443"
        ],
        "cwd": "/",
        "capabilities":
        {
            "bounding":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "effective":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "permitted":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ]
        },
        "oomScoreAdj": 1000
    },
    "root":
    {
        "path": "rootfs"
    },
    "mounts":
    [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options":
            [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options":
            [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        },
        {
            "destination": "/dev/nvidiactl",
            "type": "bind",
            "source": "/dev/nvidiactl",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia0",
            "type": "bind",
            "source": "/dev/nvidia0",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia-uvm",
            "type": "bind",
            "source": "/dev/nvidia-uvm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hosts",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/etc-hosts",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/termination-log",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/containers/cuda-vector-add/e8ed181c",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hostname",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/hostname",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/resolv.conf",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/resolv.conf",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "bind",
            "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/shm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/volumes/kubernetes.io~secret/default-token",
            "options":
            [
                "rbind",
                "rprivate",
                "ro"
            ]
        }
    ],
    "hooks":
    {
        "prestart":
        [
            {
                "path": "/usr/bin/nvidia-container-runtime-hook",
                "args":
                [
                    "/usr/bin/nvidia-container-runtime-hook",
                    "prestart"
                ]
            }
        ]
    },
    "annotations":
    {
        "io.kubernetes.cri.container-name": "cuda-vector-add",
        "io.kubernetes.cri.container-type": "container",
        "io.kubernetes.cri.image-name": "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04",
        "io.kubernetes.cri.sandbox-id": "c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd",
        "io.kubernetes.cri.sandbox-name": "gpu-operator-test",
        "io.kubernetes.cri.sandbox-namespace": "default"
    },
    "linux":
    {
        "resources":
        {
            "devices":
            [
                {
                    "allow": false,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 255,
                    "access": "rw"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 3,
                    "access": "rw"
                }
            ],
            "memory":
            {},
            "cpu":
            {
                "shares": 2,
                "period": 100000,
                "cpus": "0-79"
            }
        },
        "cgroupsPath": "kubepods-besteffort-podc043b554_0f9c_4db6_874b_6977ee24fa96.slice:cri-containerd:73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511",
        "namespaces":
        [
            {
                "type": "pid"
            },
            {
                "type": "ipc",
                "path": "/proc/340434/ns/ipc"
            },
            {
                "type": "uts",
                "path": "/proc/340434/ns/uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "network",
                "path": "/proc/340434/ns/net"
            }
        ],
        "devices":
        [
            {
                "path": "/dev/nvidiactl",
                "type": "c",
                "major": 195,
                "minor": 255,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/nvidia3",
                "type": "c",
                "major": 195,
                "minor": 3,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "maskedPaths":
        [
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware"
        ],
        "readonlyPaths":
        [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

systemd conf: cat /run/systemd/system/cri-containerd-73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow= DeviceAllow=/dev/char/195:255 rw DeviceAllow=/dev/char/195:3 rw DeviceAllow=char-pts rwm DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=char-* m DeviceAllow=block-* m

Sep 01 '22 05:09 yangfeiyu20102011

@thaJeztah @kolyshkin Hi，is there a better solution for solving this problem?

Sep 19 '22 07:09 yangfeiyu20102011

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

Sep 19 '22 09:09 cyphar

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

@cyphar Thanks. Is there a plan for solving this issue? I can use this patch in my personal project, but I still hope this problem can be fixed in the latest runc.

Sep 20 '22 03:09 yangfeiyu20102011

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

Checking is not an issue. The fact that LinuxDeviceCgroup in OCI runtime spec doesn't have Path field is.

Now I'm thinking about creating a device file and passing it to systemd; this might be easier and less error prone.

Sep 20 '22 03:09 kolyshkin

Any better solution about this issue ?

Dec 14 '22 04:12 zvier

The same problem refers to https://github.com/NVIDIA/nvidia-docker/issues/1730 and a fix will be present in the next patch release of all supported NVIDIA GPU drivers.

Feb 27 '23 07:02 zvier

This is now being fixed by #3842.

Apr 25 '23 00:04 kolyshkin