gpu-operator pod for vgpu is Init:CrashLoopBackOff (failed to load module nvidia-uvm)
hi ,i install vgpu host driver for ubuntu 22.04 lts on physical host (nvidia-vgpu-ubuntu-580_580.105.06_amd64.deb) , my k8s cluster gpu-operator(vgpu model)
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/25.3.4/install-gpu-operator-vgpu.html
after installed, pod crash always
root@kf1:~# kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-nqfs4 0/1 Init:0/1 0 57m
gpu-operator-1763694419-node-feature-discovery-gc-ddfc6c8cpdjlq 1/1 Running 0 58m
gpu-operator-1763694419-node-feature-discovery-master-85b72ldhc 1/1 Running 0 58m
gpu-operator-1763694419-node-feature-discovery-worker-7v227 1/1 Running 0 58m
gpu-operator-1763694419-node-feature-discovery-worker-f2292 1/1 Running 0 58m
gpu-operator-1763694419-node-feature-discovery-worker-pxf8m 1/1 Running 0 58m
gpu-operator-1763694419-node-feature-discovery-worker-vk65s 1/1 Running 0 58m
gpu-operator-69f467c76b-sfmkn 1/1 Running 0 58m
nvidia-container-toolkit-daemonset-l8n2j 0/1 Init:CrashLoopBackOff 16 (5s ago) 57m
nvidia-dcgm-exporter-4bbpt 0/1 Init:0/1 0 57m
nvidia-device-plugin-daemonset-ct9pv 0/1 Init:0/1 0 57m
nvidia-operator-validator-85qxj 0/1 Init:Error 16 (4m57s ago) 57m
logs
root@kf1:~# kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server (BadRequest): container "nvidia-container-toolkit-ctr" in pod "nvidia-container-toolkit-daemonset-l8n2j" is waiting to start: PodInitializing
root@kf1:~# kubectl describe -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
error: the server doesn't have a resource type "nvidia-container-toolkit-daemonset-l8n2j"
root@kf1:~# kubectl describe pod -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
Name: nvidia-container-toolkit-daemonset-l8n2j
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: v100/10.133.72.200
Start Time: Fri, 21 Nov 2025 03:08:00 +0000
Labels: app=nvidia-container-toolkit-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=987666d87
helm.sh/chart=gpu-operator-v25.10.0
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: aae750a252cf75105ac8bbdeb9071f27a8af994c734994a93799327b22a70933
cni.projectcalico.org/podIP: 10.42.3.250/32
cni.projectcalico.org/podIPs: 10.42.3.250/32
Status: Pending
IP: 10.42.3.250
IPs:
IP: 10.42.3.250
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://fcf93324fe05c45eaa17445217a3d7fd737aab00c08d452f74be14860648ebd7
Image: nvcr.io/nvidia/gpu-operator:v25.10.0
Image ID: nvcr.io/nvidia/gpu-operator@sha256:d4841412c9b8d27c53b1588dcebffc5451e9ab0fd36b2c656c33a65356507cfd
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 21 Nov 2025 04:05:09 +0000
Finished: Fri, 21 Nov 2025 04:05:10 +0000
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqhgx (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID:
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.18.0
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
Args:
/bin/entrypoint.sh
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
ROOT: /usr/local/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
CDI_ENABLED: true
NVIDIA_RUNTIME_SET_AS_DEFAULT: false
NVIDIA_CONTAINER_RUNTIME_MODE: cdi
CRIO_CONFIG_MODE: config
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
RUNTIME_CONFIG: /runtime/config-dir/config.toml
CONTAINERD_CONFIG: /runtime/config-dir/config.toml
RUNTIME_DROP_IN_CONFIG: /runtime/config-dir.d/99-nvidia.toml
RUNTIME_DROP_IN_CONFIG_HOST_PATH: /etc/containerd/conf.d/99-nvidia.toml
RUNTIME_SOCKET: /runtime/sock-dir/containerd.sock
CONTAINERD_SOCKET: /runtime/sock-dir/containerd.sock
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir.d/ from containerd-drop-in-config (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqhgx (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
containerd-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd
HostPathType: DirectoryOrCreate
containerd-drop-in-config:
Type: HostPath (bare host directory volume)
Path: /etc/containerd/conf.d
HostPathType: DirectoryOrCreate
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
kube-api-access-vqhgx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-l8n2j to v100
Normal Started 41m (x9 over 58m) kubelet Started container driver-validation
Normal Created 36m (x10 over 58m) kubelet Created container: driver-validation
Warning BackOff 2m47s (x254 over 57m) kubelet Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-l8n2j_gpu-operator(c5183556-4b31-4ec6-a671-ce2055574596)
Normal Pulled 54s (x17 over 58m) kubelet Container image "nvcr.io/nvidia/gpu-operator:v25.10.0" already present on machine
other logs
root@kf1:~# kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j -c driver-validation --tail=50
time="2025-11-21T04:05:09Z" level=info msg="version: 6f3d599d-amd64, commit: 6f3d599"
time="2025-11-21T04:05:09Z" level=info msg="Attempting to validate a pre-installed driver on the host"
Fri Nov 21 12:05:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.06 Driver Version: 580.105.06 CUDA Version: N/A |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:8A:00.0 Off | Off |
| N/A 39C P0 69W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-32GB On | 00000000:8B:00.0 Off | Off |
| N/A 40C P0 71W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-32GB On | 00000000:8C:00.0 Off | Off |
| N/A 53C P0 74W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-32GB On | 00000000:8D:00.0 Off | Off |
| N/A 45C P0 69W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 Tesla V100-SXM2-32GB On | 00000000:B3:00.0 Off | Off |
| N/A 36C P0 46W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 Tesla V100-SXM2-32GB On | 00000000:B4:00.0 Off | Off |
| N/A 43C P0 71W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 Tesla V100-SXM2-32GB On | 00000000:B5:00.0 Off | Off |
| N/A 39C P0 44W / 300W | 61MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
time="2025-11-21T04:05:10Z" level=info msg="Detected a pre-installed driver on the host"
time="2025-11-21T04:05:10Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2025-11-21T04:05:10Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/6.8.0-40-generic\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
root@kf1:~# Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/6.8.0-40-generic\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""
>
>
>
> exit
>
> ^C
root@kf1:~#
so ,most important point is :"creating symlinks under /dev/char that correspond to NVIDIA character devices" time="2025-11-21T04:05:10Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe:"
the physical gpu server v100 have not nvidia-uvm model ,i also qoute the install pack ,no nvidia-uvm pack in it
how i resolve this probelm ?
my goal is use kubeflow to create notebook pod use the mdev(like v100-2c) in k8s cluster .