cgroup/systemd: fix making CharDevice path in systemdProperties
cgroup/systemd: func systemdProperties will set CharDevice path like /dev/char/0:0,
but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x, getNVIDIAEntryPath will fix this problem.
Signed-off-by: yangfeiyu20102011 [email protected]
PTAL, thanks! cc @AkihiroSuda @thaJeztah https://github.com/opencontainers/runc/issues/3567
Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?
Thanks, I have modified the code.
When doing runc update, it will skip setting the cgroup device files.
If the spec contains devices like /dev/nvidia*, it will make the DeviceAllow.conf as follow.
cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DeviceAllow.conf
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/195:0 rw
DeviceAllow=/dev/char/195:1 rw
The DeviceAllow=/dev/char/195:0 rw will not work.
And if DevicePolicy.conf set DevicePolicy=strict, the devices.list may end in
c 195:* m
after
setUnitProperties(m.dbus, unitName, properties...)
cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DevicePolicy.conf
[Scope]
DevicePolicy=strict
cat /sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podeeccd2f8_7bef_4054_a659_6554b908432a.slice/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope/devices.list
c 136:* rwm
c 5:2 rwm
c 195:* m
Indeed, not all character devices have /dev/char/MM:mm equivalent for some reason. Here's what I found on my machine (Fedora 36 laptop running kernel 5.17.14-300.fc36.x86_64):
[kir@kir-rhat linux]$ ls -lR /dev | grep ^c | awk '{print $10, $5, $6}' | sed -e 's|, |:|' -e 's| | /dev/char/|' | awk '{printf "/dev/" $1 "\t"; system("ls -l " $2);}' 2>&1 | grep cannot
/dev/cuse ls: cannot access '/dev/char/10:203': No such file or directory
/dev/lp0 ls: cannot access '/dev/char/6:0': No such file or directory
/dev/lp1 ls: cannot access '/dev/char/6:1': No such file or directory
/dev/lp2 ls: cannot access '/dev/char/6:2': No such file or directory
/dev/lp3 ls: cannot access '/dev/char/6:3': No such file or directory
/dev/ppp ls: cannot access '/dev/char/108:0': No such file or directory
/dev/uhid ls: cannot access '/dev/char/10:239': No such file or directory
/dev/uinput ls: cannot access '/dev/char/10:223': No such file or directory
/dev/vhci ls: cannot access '/dev/char/10:137': No such file or directory
/dev/vhost-vsock ls: cannot access '/dev/char/10:241': No such file or directory
(there might be some more char devices in subdirectories of /dev)
For block devices, I haven't found any that does not have a symlink in /dev/block.
Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found. WDYT @cyphar
Now the NVIDIA devices in DeviceAllow.conf are not as expected. This PR can solve some NVIDIA GPU rw problems and it is a improved method at least. We can completely solve the char devices problem in a better way in the future. cc @thaJeztah @cyphar
@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?
@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?
cc @kolyshkin OK, here are the spec and DeviceAllow.conf oci spec
{
"ociVersion": "1.0.2-dev",
"process":
{
"user":
{
"uid": 0,
"gid": 0
},
"args":
[
"sleep",
"36000"
],
"env":
[
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=gpu-operator-test",
"NVARCH=x86_64",
"NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419",
"NV_CUDA_CUDART_VERSION=11.0.221-1",
"NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-0",
"CUDA_VERSION=11.0.3",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"NVIDIA_VISIBLE_DEVICES=all",
"NVIDIA_DRIVER_CAPABILITIES=compute,utility",
"NVIDIA_VISIBLE_DEVICES=GPU-44f83262-58b8-2db0-7960-01d193fcf7b5",
"NOT_HOST_NETWORK=true",
"KUBERNETES_SERVICE_PORT=443",
"KUBERNETES_SERVICE_PORT_HTTPS=443"
],
"cwd": "/",
"capabilities":
{
"bounding":
[
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"effective":
[
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"permitted":
[
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
]
},
"oomScoreAdj": 1000
},
"root":
{
"path": "rootfs"
},
"mounts":
[
{
"destination": "/proc",
"type": "proc",
"source": "proc",
"options":
[
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options":
[
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options":
[
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options":
[
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options":
[
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options":
[
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
},
{
"destination": "/dev/nvidiactl",
"type": "bind",
"source": "/dev/nvidiactl",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/dev/nvidia0",
"type": "bind",
"source": "/dev/nvidia0",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/dev/nvidia-uvm",
"type": "bind",
"source": "/dev/nvidia-uvm",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/etc/hosts",
"type": "bind",
"source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/etc-hosts",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/dev/termination-log",
"type": "bind",
"source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/containers/cuda-vector-add/e8ed181c",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/etc/hostname",
"type": "bind",
"source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/hostname",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/resolv.conf",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/shm",
"options":
[
"rbind",
"rprivate",
"rw"
]
},
{
"destination": "/var/run/secrets/kubernetes.io/serviceaccount",
"type": "bind",
"source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/volumes/kubernetes.io~secret/default-token",
"options":
[
"rbind",
"rprivate",
"ro"
]
}
],
"hooks":
{
"prestart":
[
{
"path": "/usr/bin/nvidia-container-runtime-hook",
"args":
[
"/usr/bin/nvidia-container-runtime-hook",
"prestart"
]
}
]
},
"annotations":
{
"io.kubernetes.cri.container-name": "cuda-vector-add",
"io.kubernetes.cri.container-type": "container",
"io.kubernetes.cri.image-name": "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04",
"io.kubernetes.cri.sandbox-id": "c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd",
"io.kubernetes.cri.sandbox-name": "gpu-operator-test",
"io.kubernetes.cri.sandbox-namespace": "default"
},
"linux":
{
"resources":
{
"devices":
[
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 195,
"minor": 255,
"access": "rw"
},
{
"allow": true,
"type": "c",
"major": 195,
"minor": 3,
"access": "rw"
}
],
"memory":
{},
"cpu":
{
"shares": 2,
"period": 100000,
"cpus": "0-79"
}
},
"cgroupsPath": "kubepods-besteffort-podc043b554_0f9c_4db6_874b_6977ee24fa96.slice:cri-containerd:73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511",
"namespaces":
[
{
"type": "pid"
},
{
"type": "ipc",
"path": "/proc/340434/ns/ipc"
},
{
"type": "uts",
"path": "/proc/340434/ns/uts"
},
{
"type": "mount"
},
{
"type": "network",
"path": "/proc/340434/ns/net"
}
],
"devices":
[
{
"path": "/dev/nvidiactl",
"type": "c",
"major": 195,
"minor": 255,
"fileMode": 438,
"uid": 0,
"gid": 0
},
{
"path": "/dev/nvidia3",
"type": "c",
"major": 195,
"minor": 3,
"fileMode": 438,
"uid": 0,
"gid": 0
}
],
"maskedPaths":
[
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware"
],
"readonlyPaths":
[
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
systemd conf: cat /run/systemd/system/cri-containerd-73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow= DeviceAllow=/dev/char/195:255 rw DeviceAllow=/dev/char/195:3 rw DeviceAllow=char-pts rwm DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=char-* m DeviceAllow=block-* m
@thaJeztah @kolyshkin Hi,is there a better solution for solving this problem?
Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.
Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).
Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.
Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).
@cyphar Thanks. Is there a plan for solving this issue? I can use this patch in my personal project, but I still hope this problem can be fixed in the latest runc.
Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.
Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).
Checking is not an issue. The fact that LinuxDeviceCgroup in OCI runtime spec doesn't have Path field is.
Now I'm thinking about creating a device file and passing it to systemd; this might be easier and less error prone.
Any better solution about this issue ?
The same problem refers to https://github.com/NVIDIA/nvidia-docker/issues/1730 and a fix will be present in the next patch release of all supported NVIDIA GPU drivers.
This is now being fixed by #3842.