RKE2: [pre-installed drivers+container-toolkit] error creating symlinks
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html
1. Quick Debug Information
- BAREMETAL
- OS/Version:Ubuntu 22.04.3 LTS
- Container Runtime Type/Version: containerd
- K8s Flavor/Version: Rancher RKE2 v1.25.12+rke2r1
- GPU Operator Version: nvidia gpu-operator-v23.6.0
2. Issue or feature description
Deploying gpu-operator there's error Error: error validating driver installation: error creating symlinks error in the nvidia-operator-validator container on RKE2 with pre-installed drivers and container toolkit.
level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
3. Steps to reproduce the issue
$ sudo apt-get install -y nvidia-driver-535-server nvidia-container-toolkit
$ sudo shutdown -r now
$ helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
--set psp.enabled=true
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
kubectl get pods -n gpu-operator - [ ] kubernetes daemonset status:
kubectl get ds -n gpu-operator - [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n gpu-operator POD_NAME - [x] If a pod/ds is in an error state or pending state
kubectl logs -n gpu-operator POD_NAME --all-containers - [ ] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n gpu-operator -c nvidia-driver-ctr -- nvidia-smi - [ ] containerd logs
journalctl -u containerd > containerd.log
spectrum@spectrum:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
spectrum@spectrum:~$ apt list --installed | grep nvidia
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnvidia-cfg1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 all [installed,automatic]
libnvidia-compute-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-container-tools/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-container1/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
libnvidia-decode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-encode-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-extra-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-gl-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-compute-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-container-toolkit-base/bionic,bionic,now 1.13.5-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,bionic,now 1.13.5-1 amd64 [installed]
nvidia-dkms-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-driver-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed]
nvidia-firmware-535-server-535.54.03/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-common-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-source-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-utils-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-535-server/jammy-updates,jammy-security,now 535.54.03-0ubuntu0.22.04.1 amd64 [installed,automatic]
spectrum@spectrum:~$ cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
spectrum@spectrum:~$ nvidia-container-cli info
NVRM version: 535.54.03
CUDA version: 12.2
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce GTX 1650
Brand: GeForce
GPU UUID: GPU-648ac414-633e-cf39-d315-eabd271dfad1
Bus Location: 00000000:01:00.0
Architecture: 7.5
spectrum@spectrum:~$ kubectl logs -n gpu-operator -p nvidia-operator-validator-j8kvt --all-containers=true
time="2023-08-17T05:24:02Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Wed Aug 16 23:24:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| 30% 35C P0 11W / 75W | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
time="2023-08-17T05:24:03Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidiactl already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-modeset already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm already exists"
time="2023-08-17T05:24:03Z" level=info msg="Skipping: /dev/nvidia-uvm-tools already exists"
time="2023-08-17T05:24:03Z" level=info msg="Error: error validating driver installation: error creating symlinks: failed to get device nodes: failed to get GPU information: error getting all NVIDIA devices: error constructing NVIDIA PCI device 0000:01:00.1: unable to get device name: failed to find device with id '10fa'\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
Error from server (BadRequest): previous terminated container "toolkit-validation" in pod "nvidia-operator-validator-j8kvt" not found
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
# tree /run/nvidia
/run/nvidia
├── driver
└── validations
2 directories, 0 files
spectrum@spectrum:/tmp/nvidia-gpu-operator_20230816_2329 $ cat gpu_operand_ds_nvidia-operator-validator.descr
Name: nvidia-operator-validator
Selector: app=nvidia-operator-validator,app.kubernetes.io/part-of=gpu-operator
Node-Selector: nvidia.com/gpu.deploy.operator-validator=true
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
helm.sh/chart=gpu-operator-v23.6.0
Annotations: deprecated.daemonset.template.generation: 1
nvidia.com/last-applied-hash: fa2bb82bef132a9a
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
helm.sh/chart=gpu-operator-v23.6.0
Service Account: nvidia-operator-validator
Init Containers:
driver-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
toolkit-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
cuda-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
plugin-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
Containers:
nvidia-operator-validator:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.0
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
Priority Class Name: system-node-critical
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 11m daemonset-controller Created pod: nvidia-operator-validator-j8kvt
spectrum@spectrum:~$ sudo nvidia-ctk system create-dev-char-symlinks
INFO[0000] Creating link /dev/char/195:254 => /dev/nvidia-modeset
WARN[0000] Could not create symlink: symlink /dev/nvidia-modeset /dev/char/195:254: file exists
INFO[0000] Creating link /dev/char/507:0 => /dev/nvidia-uvm
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm /dev/char/507:0: file exists
INFO[0000] Creating link /dev/char/507:1 => /dev/nvidia-uvm-tools
WARN[0000] Could not create symlink: symlink /dev/nvidia-uvm-tools /dev/char/507:1: file exists
INFO[0000] Creating link /dev/char/195:0 => /dev/nvidia0
WARN[0000] Could not create symlink: symlink /dev/nvidia0 /dev/char/195:0: file exists
INFO[0000] Creating link /dev/char/195:255 => /dev/nvidiactl
WARN[0000] Could not create symlink: symlink /dev/nvidiactl /dev/char/195:255: file exists
INFO[0000] Creating link /dev/char/511:1 => /dev/nvidia-caps/nvidia-cap1
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap1 /dev/char/511:1: file exists
INFO[0000] Creating link /dev/char/511:2 => /dev/nvidia-caps/nvidia-cap2
WARN[0000] Could not create symlink: symlink /dev/nvidia-caps/nvidia-cap2 /dev/char/511:2: file exists
Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
to the validator.driver.env.
cc @cdesiniotis
I got it finally working after removing /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl (by renaming)
I'll dig more into the differences later.
$ mv config.toml.tmpl config.toml.tmpl-nvidia
$ sudo service containerd restart
$ sudo service rke2-server restart
$ helm uninstall gpu-operator -n gpu-operator
$ helm install gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
--set psp.enabled=true \
--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
--set-string validator.driver.env[0].value=true
Looks like the original rke2 containerd config differs from what I had placed in the tmpl file per guidance on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.0/getting-started.html#bare-metal-passthrough-with-pre-installed-drivers-and-nvidia-container-toolkit
But it still works...
$ sudo ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime --env NVIDIA_VISIBLE_DEVICES=all docker.io/nvidia/cuda:12.2.0-base-ubuntu22.04 cuda-22.2.0-base-ubuntu22.04 nvidia-smiThu Aug 17 17:36:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| 34% 36C P8 7W / 75W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@spectrum:/var/lib/rancher/rke2/agent/etc/containerd# cat config.toml
# File generated by rke2. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = "index.docker.io/rancher/pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
Still investigating...
Looks like I have multiple versions of nvidia-container-runtime installed somehow. Still investigating, as this appears to not be working but the node/containers can start now (couldn't before)...
Upgrading to v23.6.1 I'm no longer able to reproduce this issue.
After reading https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/release-notes.html#fixed-issues I attempted this with the new version. My issue has been resolved.
@elezar I think this can be closed as it appears the v23.6.1 release has fixed this problem.
I still can reproduce this issue with version v23.9.0.
In our case the NVIDIA drivers come pre-installed, and I can see devices /dev/nvidia*
Hi @DevKyleS. We are aware of this issue. For the time being, please update the culster policy and add:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION value: "true"to the
validator.driver.env.cc @cdesiniotis
I tried this on my cluster policy and restarted the cluster but still get the same error. I am using version 23.9.0
I still can reproduce this issue with version v23.9.0.
In our case the NVIDIA drivers come pre-installed, and I can see devices
/dev/nvidia*
Did you find a workaround?
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION value: "true"
This what we have in our values.yaml:
validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
I hit this today on v23.9.1
Adding DISABLE_DEV_CHAR_SYMLINK_CREATION resolved it in my case.
That said - the release notes say this should have been fixed in 23.6.1: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/release-notes.html#id8
And I am definitely still seeing it in 23.9.1
Mostly consumer GPUs (RTX2080s) on my nodes.
Just encountered this with a Tesla P4 on v23.9.1/rke2 v1.29
We also encountered the same problem in v23.9.0. I manually modified the DISABLE_DEV_CHAR_SYMLINK_CREATION parameter as prompted, and the container-toolkit works normally. However, the tookit-validator check of nvidia-operator-validator fails, and the following error message is displayed
gpu-operator version
libnvidia-ml.so host path
exec validator container:
/usr/lib64/ libnvidia-ml.so