gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

GPU driver validator errors unable to load kernel module nvidia-modeset.

Open eliphatfs opened this issue 1 year ago • 1 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.2
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3s
  • GPU Operator Version: v23.9.1

2. Issue or feature description

I installed gpu-operator with helm disabling driver and toolkits as they exist and are tested to be working. The installation was mainly for monitor metrics.

The operator-validator and container-toolkit-daemonset are in error state. The operator-validator fails with the following log:

Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia-modeset: exit status 1; output=modprobe: ERROR: could not insert 'nvidia_modeset': No such device


Failed to create symlinks under /dev/char that point to all possible NVIDIA character devices.
The existence of these symlinks is required to address the following bug:


This bug impacts container runtimes configured with systemd cgroup management enabled.
To disable the symlink creation, set the following envvar in ClusterPolicy:

    validator:
      driver:
        env:
        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
          value: "true"

It seems weird that it is required to address an empty bug, and nvidia-modeset should be only relevant for display drivers. The cluster is a H100 cluster without display support.

3. Steps to reproduce the issue

I am unsure. I cannot stop and redo the whole cluster from scratch as there is other stuff running.

4. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status:
NAME                                                         READY   STATUS                  RESTARTS      AGE
gpu-operator-node-feature-discovery-worker-bw7gq             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-cct7m             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-qqfpk             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-l8jjj             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-wfdb4             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-n7qfk             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-2sqw8             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-dp249             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-gc-7cc7ccfff8-9bmpk      1/1     Running                 0             17m
gpu-operator-node-feature-discovery-worker-c4dcz             1/1     Running                 0             17m
gpu-operator-node-feature-discovery-master-d8597d549-2k8gv   1/1     Running                 0             17m
nvidia-dcgm-exporter-cbmqt                                   0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-mczt4                         0/1     Init:0/1                0             17m
gpu-feature-discovery-sthjb                                  0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-tbwcq                                   0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-7fw57                         0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-q7lj2                         0/1     Init:0/1                0             17m
gpu-feature-discovery-qc7w4                                  0/1     Init:0/1                0             17m
gpu-feature-discovery-j276b                                  0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-fl696                                   0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-bx9wv                         0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-n6dc6                         0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-kgw6x                                   0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-jwxb4                                   0/1     Init:0/1                0             17m
gpu-feature-discovery-hvww8                                  0/1     Init:0/1                0             17m
gpu-feature-discovery-5n894                                  0/1     Init:0/1                0             17m
gpu-feature-discovery-vp5pc                                  0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-q2lj2                         0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-4sf24                                   0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-96hv4                         0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-lhgsl                                   0/1     Init:0/1                0             17m
gpu-feature-discovery-wfzs9                                  0/1     Init:0/1                0             17m
gpu-feature-discovery-hz76m                                  0/1     Init:0/1                0             17m
nvidia-dcgm-exporter-xh9rp                                   0/1     Init:0/1                0             17m
nvidia-device-plugin-daemonset-l5rvn                         0/1     Init:0/1                0             17m
gpu-operator-999cc8dcc-cj7nc                                 1/1     Running                 0             17m
nvidia-operator-validator-ljsq2                              0/1     Init:CrashLoopBackOff   8 (78s ago)   17m
nvidia-operator-validator-pd2jt                              0/1     Init:CrashLoopBackOff   8 (70s ago)   17m
nvidia-operator-validator-248td                              0/1     Init:CrashLoopBackOff   8 (74s ago)   17m
nvidia-container-toolkit-daemonset-qzcgk                     0/1     Init:CrashLoopBackOff   8 (65s ago)   17m
nvidia-operator-validator-ghgrt                              0/1     Init:CrashLoopBackOff   8 (59s ago)   17m
nvidia-container-toolkit-daemonset-lbqjm                     0/1     Init:CrashLoopBackOff   8 (56s ago)   17m
nvidia-container-toolkit-daemonset-qmfpq                     0/1     Init:CrashLoopBackOff   8 (56s ago)   17m
nvidia-container-toolkit-daemonset-f7t4f                     0/1     Init:CrashLoopBackOff   8 (57s ago)   17m
nvidia-container-toolkit-daemonset-42kvm                     0/1     Init:CrashLoopBackOff   8 (58s ago)   17m
nvidia-container-toolkit-daemonset-pr8fj                     0/1     Init:CrashLoopBackOff   8 (57s ago)   17m
nvidia-operator-validator-c7n4l                              0/1     Init:CrashLoopBackOff   8 (54s ago)   17m
nvidia-container-toolkit-daemonset-r5tqh                     0/1     Init:CrashLoopBackOff   8 (51s ago)   17m
nvidia-operator-validator-hwhzs                              0/1     Init:CrashLoopBackOff   8 (51s ago)   17m
nvidia-operator-validator-qwg6k                              0/1     Init:CrashLoopBackOff   8 (45s ago)   17m
nvidia-operator-validator-b427r                              0/1     Init:CrashLoopBackOff   8 (48s ago)   17m
nvidia-container-toolkit-daemonset-z4rzl                     0/1     Init:CrashLoopBackOff   8 (44s ago)   17m
  • [ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
gpu-operator            gpu-operator-node-feature-discovery-worker       9         9         9       9            9           <none>                                             18m
gpu-operator            nvidia-container-toolkit-daemonset               8         8         0       8            0           nvidia.com/gpu.deploy.container-toolkit=true       18m
gpu-operator            nvidia-mig-manager                               0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             17m
gpu-operator            nvidia-operator-validator                        8         8         0       8            0           nvidia.com/gpu.deploy.operator-validator=true      17m
gpu-operator            nvidia-device-plugin-daemonset                   8         8         0       8            0           nvidia.com/gpu.deploy.device-plugin=true           17m
gpu-operator            nvidia-dcgm-exporter                             8         8         0       8            0           nvidia.com/gpu.deploy.dcgm-exporter=true           17m
gpu-operator            gpu-feature-discovery                            8         8         0       8            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   17m
gpu-operator            nvidia-driver-daemonset                          0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                  18m
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:00:05.0 Off |                    0 |
| N/A   25C    P0              73W / 700W |    150MiB / 81559MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:00:06.0 Off |                    0 |
| N/A   26C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:00:07.0 Off |                    0 |
| N/A   27C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:00:08.0 Off |                    0 |
| N/A   25C    P0              71W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:00:09.0 Off |                    0 |
| N/A   25C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:00:0A.0 Off |                    0 |
| N/A   28C    P0              71W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:00:0B.0 Off |                    0 |
| N/A   26C    P0              75W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:00:0C.0 Off |                    0 |
| N/A   24C    P0              75W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3575      G   /usr/lib/xorg/Xorg                           60MiB |
|    0   N/A  N/A      3866      G   /usr/bin/gnome-shell                         79MiB |
+---------------------------------------------------------------------------------------+

containerd logs are too big, i cannot attach them here.

eliphatfs avatar Feb 18 '24 09:02 eliphatfs

cc @shivamerla @elezar

cdesiniotis avatar Feb 28 '24 18:02 cdesiniotis