GPU Permission Lost Inside Container
Description
"Even after applying patch https://github.com/opencontainers/runc/pull/3824, the issue still persists on systemd version 249." https://github.com/opencontainers/runc/issues/3708
Extreme solution
If I comment out the code responsible for generating device properties, preventing systemd from taking over device permissions, the issue is resolved.
Steps to reproduce the issue
systemctl --version systemd 249 (v249-48) +PAM +AUDIT +SELINUX -APPARMOR +IMA -SMACK +SECCOMP +GCRYPT +GNUTLS -OPENSSL +ACL +BLKID -CURL -ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD -LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB -ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=legacy
docker gust:
nvidia-smi Wed Aug 20 06:34:52 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A | | 22% 30C P8 5W / 250W | 1MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Host:# systemctl daemon-reload
docker gust:
nvidia-smi Failed to initialize NVML: Unknown Error
Describe the results you received and expected
GPU permissions are available within the container.
What version of runc are you using?
runc 1.1.4
Host OS information
No response
Host kernel information
No response
@zhoaxiaohu the PR you mentioned does not seem related to this issue. Could you please update and also provide information on how your containers are being started.
Note that even if this fix is applied there is still a known issue as described here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error
run Container: ctr -n k8s.io run --runc-binary /usr/bin/nvidia-container-runtime --rm --tty --env NVIDIA_VISIBLE_DEVICES=3 --env KGPU_MEM_DEV=23028 --env KGPU_SCHD_WEIGHT=0 --env KGPU_MEM_CONTAINER=10000 org.gpu.com/vgpu/video-worker-faas:1.0-release test_gpu_9 /bin/bash
runc: https://github.com/opencontainers/runc tag:V1.3.0 If I comment out the code responsible for generating device properties, preventing systemd from taking over device permissions, the issue is resolved.
~~The nvidia-container-runtime uses runc hooks to reconfigure the cgroup without giving runc any information about the changed configuration, meaning that systemd is not aware of the NVIDIA devices being added and when systemd decides to reload its cgroup configuration it will reset the configuration to what it believes it should be.~~
~~From where I'm standing, this is an issue with nvidia-container-runtime -- if they want to modify the devices configuration, they should reconfigure systemd. This is actually required for cgroupv2 systems because the devices cgroup is now controlled through eBPF and you cannot just add an allow rule like you could in cgroupv1 -- you need to tell systemd about the rule in order for it to be included in the systemd-managed eBPF program for the cgroup. They should just add the necessary DeviceAllow properties to the TransientUnit for the container.~~
EDIT: I forgot that nvidia-container-runtime was modified a few years ago to change config.json directly, as well as the history behind #3842 which is the PR you probably meant to link.
It is not a given that systemd will not reconfigure the devices cgroup if you do not tell it the rules you want -- my experience is that systemd will happily reconfigure any cgroup knob it likes unless you tell it what you want.
(For cgroupv2, commenting out a similar section to the one you commented out would disable all device rules and causes systemd to apply its default allow-all device rules.)
@zhoaxiaohu Did you mean to link #3842 in the description?
As discussed in that PR, systemd 240 switched to parsing /dev/char/A:B directly so that issue should be solved for newer systemd versions. There was also a bug introduced in systemd v230 that seemed to be related (https://github.com/systemd/systemd/issues/35710), which was worked around in #4612.
Can you provide the contents of the corresponding /run/systemd/transient/$ctr.scope.d/50-Device*.conf files?
In general it is not a given that systemd will not reconfigure the devices cgroup if you do not tell it the rules you want -- my experience is that systemd will happily reconfigure any cgroup knob it likes unless you tell it what you want.
# systemctl status cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope
● cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope - libcontainer container 4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd>
Loaded: loaded (/run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope; transient)
Transient: yes
Drop-In: /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d
└─50-CPUShares.conf, 50-DeviceAllow.conf, 50-DevicePolicy.conf
Active: active (running) since Fri 2025-08-08 11:51:18 CST; 1 week 6 days ago
Tasks: 1 (limit: 1645101)
Memory: 1.0M
CPU: 11ms
CGroup: /kubepods.slice/kubepods-pod30da9001_f4db_446b_877d_8f4cca69e72c.slice/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope
└─ 13170 /pause
Notice: journal has been rotated since unit was started, output may be incomplete.
#cat /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m
# cat /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d/50-DevicePolicy.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DevicePolicy=strict
I understand that for /dev/nvidia devices, runc does not append the corresponding NVIDIA entries to the deviceAllowList after parsing /proc/devices, or the propagated device rules are incorrect, resulting in systemd not including them in 50-DeviceAllow.conf.
#cat /proc/devices | grep nvidia
195 nvidia-frontend
236 nvidia-nvswitch
237 nvidia-nvlink
238 nvidia-caps
511 nvidia-uvm
#ls /dev/nvidia* -l
crw-rw-rw- 1 root root 195, 0 Aug 8 11:47 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Aug 8 11:47 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Aug 8 11:47 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Aug 8 11:47 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Aug 8 11:47 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug 8 11:47 /dev/nvidia-modeset
crw-rw-rw- 1 root root 511, 0 Aug 8 11:50 /dev/nvidia-uvm
crw-rw-rw- 1 root root 511, 1 Aug 8 11:50 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
cr-------- 1 root root 238, 1 Aug 8 11:47 nvidia-cap1
cr--r--r-- 1 root root 238, 2 Aug 8 11:47 nvidia-cap2
So, nvidia-container-runtime is supposed to be setting them in our config.json and so they should be in DeviceAllow (those are set based on our configuration of the transient unit). I don't have an NVIDIA GPU to test with unfortunately...
Hi @cyphar @zhoaxiaohu @elezar , I tested your suggestion of explicitly setting the devices in config.json, and it worked as expected. Below are the steps I followed. I set up config.hjson manually for now, but I’ll be raising a PR upstream in nvidia-container-runtime to address this properly.
- Created a config.json with nvidia devices in linux.resources.devices:
"linux": {
"resources": {
"devices": [
{ "allow": true, "type": "c", "major": 195, "minor": 0, "access": "rwm" },
{ "allow": true, "type": "c", "major": 195, "minor": 255, "access": "rwm" },
{ "allow": true, "type": "c", "major": 236, "minor": 0, "access": "rwm" },
{ "allow": true, "type": "c", "major": 236, "minor": 1, "access": "rwm" }
]
},
"cgroupsPath": "system.slice:runc:gpu-hook-test"
}
- verified DeviceAllow now Includes Nvidia
- tried
nvidia-smibefore and aftersystemctl daemon-reloadand it works without issues