runc GPU Permission Lost Inside Container

Description

"Even after applying patch https://github.com/opencontainers/runc/pull/3824, the issue still persists on systemd version 249." https://github.com/opencontainers/runc/issues/3708

Extreme solution If I comment out the code responsible for generating device properties, preventing systemd from taking over device permissions, the issue is resolved.

Steps to reproduce the issue

systemctl --version systemd 249 (v249-48) +PAM +AUDIT +SELINUX -APPARMOR +IMA -SMACK +SECCOMP +GCRYPT +GNUTLS -OPENSSL +ACL +BLKID -CURL -ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD -LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB -ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=legacy

docker gust：

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Host：# systemctl daemon-reload

docker gust：

nvidia-smi Failed to initialize NVML: Unknown Error

Describe the results you received and expected

GPU permissions are available within the container.

What version of runc are you using?

runc 1.1.4

Host OS information

No response

Host kernel information

No response

Aug 20 '25 07:08 zhoaxiaohu

@zhoaxiaohu the PR you mentioned does not seem related to this issue. Could you please update and also provide information on how your containers are being started.

Note that even if this fix is applied there is still a known issue as described here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error

Aug 20 '25 07:08 elezar

run Container: ctr -n k8s.io run --runc-binary /usr/bin/nvidia-container-runtime --rm --tty --env NVIDIA_VISIBLE_DEVICES=3 --env KGPU_MEM_DEV=23028 --env KGPU_SCHD_WEIGHT=0 --env KGPU_MEM_CONTAINER=10000 org.gpu.com/vgpu/video-worker-faas:1.0-release test_gpu_9 /bin/bash

Aug 20 '25 13:08 zhoaxiaohu

runc: https://github.com/opencontainers/runc tag：V1.3.0 If I comment out the code responsible for generating device properties, preventing systemd from taking over device permissions, the issue is resolved.

Aug 20 '25 13:08 zhoaxiaohu

~~The nvidia-container-runtime uses runc hooks to reconfigure the cgroup without giving runc any information about the changed configuration, meaning that systemd is not aware of the NVIDIA devices being added and when systemd decides to reload its cgroup configuration it will reset the configuration to what it believes it should be.~~

~~From where I'm standing, this is an issue with nvidia-container-runtime -- if they want to modify the devices configuration, they should reconfigure systemd. This is actually required for cgroupv2 systems because the devices cgroup is now controlled through eBPF and you cannot just add an allow rule like you could in cgroupv1 -- you need to tell systemd about the rule in order for it to be included in the systemd-managed eBPF program for the cgroup. They should just add the necessary DeviceAllow properties to the TransientUnit for the container.~~

EDIT: I forgot that nvidia-container-runtime was modified a few years ago to change config.json directly, as well as the history behind #3842 which is the PR you probably meant to link.

It is not a given that systemd will not reconfigure the devices cgroup if you do not tell it the rules you want -- my experience is that systemd will happily reconfigure any cgroup knob it likes unless you tell it what you want.

(For cgroupv2, commenting out a similar section to the one you commented out would disable all device rules and causes systemd to apply its default allow-all device rules.)

Aug 21 '25 02:08 cyphar

@zhoaxiaohu Did you mean to link #3842 in the description?

As discussed in that PR, systemd 240 switched to parsing /dev/char/A:B directly so that issue should be solved for newer systemd versions. There was also a bug introduced in systemd v230 that seemed to be related (https://github.com/systemd/systemd/issues/35710), which was worked around in #4612.

Can you provide the contents of the corresponding /run/systemd/transient/$ctr.scope.d/50-Device*.conf files?

In general it is not a given that systemd will not reconfigure the devices cgroup if you do not tell it the rules you want -- my experience is that systemd will happily reconfigure any cgroup knob it likes unless you tell it what you want.

Aug 21 '25 03:08 cyphar

# systemctl status cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope 
● cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope - libcontainer container 4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd>
     Loaded: loaded (/run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d
             └─50-CPUShares.conf, 50-DeviceAllow.conf, 50-DevicePolicy.conf
     Active: active (running) since Fri 2025-08-08 11:51:18 CST; 1 week 6 days ago
      Tasks: 1 (limit: 1645101)
     Memory: 1.0M
        CPU: 11ms
     CGroup: /kubepods.slice/kubepods-pod30da9001_f4db_446b_877d_8f4cca69e72c.slice/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope
             └─ 13170 /pause

Notice: journal has been rotated since unit was started, output may be incomplete.

#cat /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d/50-DeviceAllow.conf 
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

# cat /run/systemd/transient/cri-containerd-4bcc7895834609ab9bc2a04be1c3def1b7fde15fff8f1db7edabf09bd0e3121e.scope.d/50-DevicePolicy.conf 
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DevicePolicy=strict

Aug 21 '25 08:08 zhoaxiaohu

I understand that for /dev/nvidia devices, runc does not append the corresponding NVIDIA entries to the deviceAllowList after parsing /proc/devices, or the propagated device rules are incorrect, resulting in systemd not including them in 50-DeviceAllow.conf.

#cat /proc/devices | grep nvidia
195 nvidia-frontend
236 nvidia-nvswitch
237 nvidia-nvlink
238 nvidia-caps
511 nvidia-uvm

#ls /dev/nvidia* -l
crw-rw-rw- 1 root root 195,   0 Aug  8 11:47 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Aug  8 11:47 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Aug  8 11:47 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Aug  8 11:47 /dev/nvidia3
crw-rw-rw- 1 root root 195, 255 Aug  8 11:47 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug  8 11:47 /dev/nvidia-modeset
crw-rw-rw- 1 root root 511,   0 Aug  8 11:50 /dev/nvidia-uvm
crw-rw-rw- 1 root root 511,   1 Aug  8 11:50 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 238, 1 Aug  8 11:47 nvidia-cap1
cr--r--r-- 1 root root 238, 2 Aug  8 11:47 nvidia-cap2

Aug 21 '25 08:08 zhoaxiaohu

So, nvidia-container-runtime is supposed to be setting them in our config.json and so they should be in DeviceAllow (those are set based on our configuration of the transient unit). I don't have an NVIDIA GPU to test with unfortunately...

Aug 21 '25 14:08 cyphar

Hi @cyphar @zhoaxiaohu @elezar , I tested your suggestion of explicitly setting the devices in config.json, and it worked as expected. Below are the steps I followed. I set up config.hjson manually for now, but I’ll be raising a PR upstream in nvidia-container-runtime to address this properly.

Created a config.json with nvidia devices in linux.resources.devices:

"linux": {
"resources": {
  "devices": [
    { "allow": true, "type": "c", "major": 195, "minor": 0, "access": "rwm" },
    { "allow": true, "type": "c", "major": 195, "minor": 255, "access": "rwm" },
    { "allow": true, "type": "c", "major": 236, "minor": 0, "access": "rwm" },
    { "allow": true, "type": "c", "major": 236, "minor": 1, "access": "rwm" }
  ]
},
"cgroupsPath": "system.slice:runc:gpu-hook-test"
}

verified DeviceAllow now Includes Nvidia

tried nvidia-smi before and after systemctl daemon-reload and it works without issues

Dec 21 '25 05:12 jokestax