gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

gpu-operator pod for vgpu is Init:CrashLoopBackOff (failed to load module nvidia-uvm)

Open yiminghub2024 opened this issue 1 month ago • 1 comments

hi ,i install vgpu host driver for ubuntu 22.04 lts on physical host (nvidia-vgpu-ubuntu-580_580.105.06_amd64.deb) , my k8s cluster gpu-operator(vgpu model) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/25.3.4/install-gpu-operator-vgpu.html
after installed, pod crash always

root@kf1:~# kubectl get pod -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-nqfs4                                       0/1     Init:0/1                0                57m
gpu-operator-1763694419-node-feature-discovery-gc-ddfc6c8cpdjlq   1/1     Running                 0                58m
gpu-operator-1763694419-node-feature-discovery-master-85b72ldhc   1/1     Running                 0                58m
gpu-operator-1763694419-node-feature-discovery-worker-7v227       1/1     Running                 0                58m
gpu-operator-1763694419-node-feature-discovery-worker-f2292       1/1     Running                 0                58m
gpu-operator-1763694419-node-feature-discovery-worker-pxf8m       1/1     Running                 0                58m
gpu-operator-1763694419-node-feature-discovery-worker-vk65s       1/1     Running                 0                58m
gpu-operator-69f467c76b-sfmkn                                     1/1     Running                 0                58m
nvidia-container-toolkit-daemonset-l8n2j                          0/1     Init:CrashLoopBackOff   16 (5s ago)      57m
nvidia-dcgm-exporter-4bbpt                                        0/1     Init:0/1                0                57m
nvidia-device-plugin-daemonset-ct9pv                              0/1     Init:0/1                0                57m
nvidia-operator-validator-85qxj                                   0/1     Init:Error              16 (4m57s ago)   57m

logs

root@kf1:~# kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server (BadRequest): container "nvidia-container-toolkit-ctr" in pod "nvidia-container-toolkit-daemonset-l8n2j" is waiting to start: PodInitializing
root@kf1:~# kubectl describe -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
error: the server doesn't have a resource type "nvidia-container-toolkit-daemonset-l8n2j"
root@kf1:~# kubectl describe pod -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j
Name:                 nvidia-container-toolkit-daemonset-l8n2j
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 v100/10.133.72.200
Start Time:           Fri, 21 Nov 2025 03:08:00 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=987666d87
                      helm.sh/chart=gpu-operator-v25.10.0
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: aae750a252cf75105ac8bbdeb9071f27a8af994c734994a93799327b22a70933
                      cni.projectcalico.org/podIP: 10.42.3.250/32
                      cni.projectcalico.org/podIPs: 10.42.3.250/32
Status:               Pending
IP:                   10.42.3.250
IPs:
  IP:           10.42.3.250
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://fcf93324fe05c45eaa17445217a3d7fd737aab00c08d452f74be14860648ebd7
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.0
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:d4841412c9b8d27c53b1588dcebffc5451e9ab0fd36b2c656c33a65356507cfd
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 21 Nov 2025 04:05:09 +0000
      Finished:     Fri, 21 Nov 2025 04:05:10 +0000
    Ready:          False
    Restart Count:  16
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqhgx (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.18.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      /bin/entrypoint.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      ROOT:                                             /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                           void
      TOOLKIT_PID_FILE:                                 /run/nvidia/toolkit/toolkit.pid
      CDI_ENABLED:                                      true
      NVIDIA_RUNTIME_SET_AS_DEFAULT:                    false
      NVIDIA_CONTAINER_RUNTIME_MODE:                    cdi
      CRIO_CONFIG_MODE:                                 config
      RUNTIME:                                          containerd
      CONTAINERD_RUNTIME_CLASS:                         nvidia
      RUNTIME_CONFIG:                                   /runtime/config-dir/config.toml
      CONTAINERD_CONFIG:                                /runtime/config-dir/config.toml
      RUNTIME_DROP_IN_CONFIG:                           /runtime/config-dir.d/99-nvidia.toml
      RUNTIME_DROP_IN_CONFIG_HOST_PATH:                 /etc/containerd/conf.d/99-nvidia.toml
      RUNTIME_SOCKET:                                   /runtime/sock-dir/containerd.sock
      CONTAINERD_SOCKET:                                /runtime/sock-dir/containerd.sock
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir.d/ from containerd-drop-in-config (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqhgx (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd
    HostPathType:  DirectoryOrCreate
  containerd-drop-in-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/containerd/conf.d
    HostPathType:  DirectoryOrCreate
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containerd
    HostPathType:  
  kube-api-access-vqhgx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  58m                    default-scheduler  Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-l8n2j to v100
  Normal   Started    41m (x9 over 58m)      kubelet            Started container driver-validation
  Normal   Created    36m (x10 over 58m)     kubelet            Created container: driver-validation
  Warning  BackOff    2m47s (x254 over 57m)  kubelet            Back-off restarting failed container driver-validation in pod nvidia-container-toolkit-daemonset-l8n2j_gpu-operator(c5183556-4b31-4ec6-a671-ce2055574596)
  Normal   Pulled     54s (x17 over 58m)     kubelet            Container image "nvcr.io/nvidia/gpu-operator:v25.10.0" already present on machine

other logs

root@kf1:~# kubectl logs -n gpu-operator nvidia-container-toolkit-daemonset-l8n2j -c driver-validation --tail=50
time="2025-11-21T04:05:09Z" level=info msg="version: 6f3d599d-amd64, commit: 6f3d599"
time="2025-11-21T04:05:09Z" level=info msg="Attempting to validate a pre-installed driver on the host"
Fri Nov 21 12:05:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.06             Driver Version: 580.105.06     CUDA Version: N/A      |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB           On  |   00000000:8A:00.0 Off |                  Off |
| N/A   39C    P0             69W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  |   00000000:8B:00.0 Off |                  Off |
| N/A   40C    P0             71W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  |   00000000:8C:00.0 Off |                  Off |
| N/A   53C    P0             74W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  |   00000000:8D:00.0 Off |                  Off |
| N/A   45C    P0             69W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  |   00000000:B3:00.0 Off |                  Off |
| N/A   36C    P0             46W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  |   00000000:B4:00.0 Off |                  Off |
| N/A   43C    P0             71W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  |   00000000:B5:00.0 Off |                  Off |
| N/A   39C    P0             44W /  300W |      61MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
time="2025-11-21T04:05:10Z" level=info msg="Detected a pre-installed driver on the host"
time="2025-11-21T04:05:10Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2025-11-21T04:05:10Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/6.8.0-40-generic\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""
root@kf1:~# Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe: FATAL: Module nvidia-uvm not found in directory /lib/modules/6.8.0-40-generic\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""
> 
> 
> 
> exit
> 
> ^C
root@kf1:~# 

so ,most important point is :"creating symlinks under /dev/char that correspond to NVIDIA character devices" time="2025-11-21T04:05:10Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-uvm: exit status 1; output=modprobe:"

the physical gpu server v100 have not nvidia-uvm model ,i also qoute the install pack ,no nvidia-uvm pack in it

how i resolve this probelm ?

my goal is use kubeflow to create notebook pod use the mdev(like v100-2c) in k8s cluster .

yiminghub2024 avatar Nov 21 '25 05:11 yiminghub2024