gpu-operator
gpu-operator copied to clipboard
GPU driver validator errors unable to load kernel module nvidia-modeset.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 6.2
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3s
- GPU Operator Version: v23.9.1
2. Issue or feature description
I installed gpu-operator with helm disabling driver and toolkits as they exist and are tested to be working.
The installation was mainly for monitor metrics.
The operator-validator and container-toolkit-daemonset are in error state. The operator-validator fails with the following log:
Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-modeset: exit status 1; output=modprobe: ERROR: could not insert 'nvidia_modeset': No such device
Failed to create symlinks under /dev/char that point to all possible NVIDIA character devices.
The existence of these symlinks is required to address the following bug:
This bug impacts container runtimes configured with systemd cgroup management enabled.
To disable the symlink creation, set the following envvar in ClusterPolicy:
validator:
driver:
env:
- name: DISABLE_DEV_CHAR_SYMLINK_CREATION
value: "true"
It seems weird that it is required to address an empty bug, and nvidia-modeset should be only relevant for display drivers. The cluster is a H100 cluster without display support.
3. Steps to reproduce the issue
I am unsure. I cannot stop and redo the whole cluster from scratch as there is other stuff running.
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
NAME READY STATUS RESTARTS AGE
gpu-operator-node-feature-discovery-worker-bw7gq 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-cct7m 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-qqfpk 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-l8jjj 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-wfdb4 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-n7qfk 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-2sqw8 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-dp249 1/1 Running 0 17m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9bmpk 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-c4dcz 1/1 Running 0 17m
gpu-operator-node-feature-discovery-master-d8597d549-2k8gv 1/1 Running 0 17m
nvidia-dcgm-exporter-cbmqt 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-mczt4 0/1 Init:0/1 0 17m
gpu-feature-discovery-sthjb 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-tbwcq 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-7fw57 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-q7lj2 0/1 Init:0/1 0 17m
gpu-feature-discovery-qc7w4 0/1 Init:0/1 0 17m
gpu-feature-discovery-j276b 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-fl696 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-bx9wv 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-n6dc6 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-kgw6x 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-jwxb4 0/1 Init:0/1 0 17m
gpu-feature-discovery-hvww8 0/1 Init:0/1 0 17m
gpu-feature-discovery-5n894 0/1 Init:0/1 0 17m
gpu-feature-discovery-vp5pc 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-q2lj2 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-4sf24 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-96hv4 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-lhgsl 0/1 Init:0/1 0 17m
gpu-feature-discovery-wfzs9 0/1 Init:0/1 0 17m
gpu-feature-discovery-hz76m 0/1 Init:0/1 0 17m
nvidia-dcgm-exporter-xh9rp 0/1 Init:0/1 0 17m
nvidia-device-plugin-daemonset-l5rvn 0/1 Init:0/1 0 17m
gpu-operator-999cc8dcc-cj7nc 1/1 Running 0 17m
nvidia-operator-validator-ljsq2 0/1 Init:CrashLoopBackOff 8 (78s ago) 17m
nvidia-operator-validator-pd2jt 0/1 Init:CrashLoopBackOff 8 (70s ago) 17m
nvidia-operator-validator-248td 0/1 Init:CrashLoopBackOff 8 (74s ago) 17m
nvidia-container-toolkit-daemonset-qzcgk 0/1 Init:CrashLoopBackOff 8 (65s ago) 17m
nvidia-operator-validator-ghgrt 0/1 Init:CrashLoopBackOff 8 (59s ago) 17m
nvidia-container-toolkit-daemonset-lbqjm 0/1 Init:CrashLoopBackOff 8 (56s ago) 17m
nvidia-container-toolkit-daemonset-qmfpq 0/1 Init:CrashLoopBackOff 8 (56s ago) 17m
nvidia-container-toolkit-daemonset-f7t4f 0/1 Init:CrashLoopBackOff 8 (57s ago) 17m
nvidia-container-toolkit-daemonset-42kvm 0/1 Init:CrashLoopBackOff 8 (58s ago) 17m
nvidia-container-toolkit-daemonset-pr8fj 0/1 Init:CrashLoopBackOff 8 (57s ago) 17m
nvidia-operator-validator-c7n4l 0/1 Init:CrashLoopBackOff 8 (54s ago) 17m
nvidia-container-toolkit-daemonset-r5tqh 0/1 Init:CrashLoopBackOff 8 (51s ago) 17m
nvidia-operator-validator-hwhzs 0/1 Init:CrashLoopBackOff 8 (51s ago) 17m
nvidia-operator-validator-qwg6k 0/1 Init:CrashLoopBackOff 8 (45s ago) 17m
nvidia-operator-validator-b427r 0/1 Init:CrashLoopBackOff 8 (48s ago) 17m
nvidia-container-toolkit-daemonset-z4rzl 0/1 Init:CrashLoopBackOff 8 (44s ago) 17m
- [ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
gpu-operator gpu-operator-node-feature-discovery-worker 9 9 9 9 9 <none> 18m
gpu-operator nvidia-container-toolkit-daemonset 8 8 0 8 0 nvidia.com/gpu.deploy.container-toolkit=true 18m
gpu-operator nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 17m
gpu-operator nvidia-operator-validator 8 8 0 8 0 nvidia.com/gpu.deploy.operator-validator=true 17m
gpu-operator nvidia-device-plugin-daemonset 8 8 0 8 0 nvidia.com/gpu.deploy.device-plugin=true 17m
gpu-operator nvidia-dcgm-exporter 8 8 0 8 0 nvidia.com/gpu.deploy.dcgm-exporter=true 17m
gpu-operator gpu-feature-discovery 8 8 0 8 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 17m
gpu-operator nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 18m
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:00:05.0 Off | 0 |
| N/A 25C P0 73W / 700W | 150MiB / 81559MiB | 1% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:00:06.0 Off | 0 |
| N/A 26C P0 72W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:00:07.0 Off | 0 |
| N/A 27C P0 72W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:00:08.0 Off | 0 |
| N/A 25C P0 71W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:00:09.0 Off | 0 |
| N/A 25C P0 72W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:00:0A.0 Off | 0 |
| N/A 28C P0 71W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:00:0B.0 Off | 0 |
| N/A 26C P0 75W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:00:0C.0 Off | 0 |
| N/A 24C P0 75W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3575 G /usr/lib/xorg/Xorg 60MiB |
| 0 N/A N/A 3866 G /usr/bin/gnome-shell 79MiB |
+---------------------------------------------------------------------------------------+
containerd logs are too big, i cannot attach them here.
cc @shivamerla @elezar