gpu-operator nvidia-cuda-validator pods crashlooping in OpenShift 4.6

Today we updated the GPU operator in one of our OpenShift clusters (Version 4.6.35) to the version 1.7.1. The upgrade included uninstalling the old GPU operator version and installing everything from scratch. After the update however, the nvidia-cuda-validator pods are crashing with the error "Failed to allocate device vector A (error code no CUDA-capable device is detected)!" in the cuda-validation init container. All other parts are running fine. The ClusterPolicy on our side looks like this:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    repository: dockerregistry.site.example.com/nvidia
    version: "2.1.8-2.4.0-rc.2-ubi8"
    image: dcgm-exporter
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: 'true'
  devicePlugin:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.9.0-ubi8
    image: k8s-device-plugin
    nodeSelector:
      nvidia.com/gpu.deploy.device-plugin: 'true'
    env:
      - name: PASS_DEVICE_SPECS
        value: "true"
      - name: FAIL_ON_INIT_ERROR
        value: "true"
      - name: DEVICE_LIST_STRATEGY
        value: "volume-mounts"
      - name: DEVICE_ID_STRATEGY
        value: "uuid"
      - name: NVIDIA_VISIBLE_DEVICES
        value: "all"
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: "all"
  driver:
    repository: dockerregistry.site.example.com/nvidia
    version: 460.73.01
    image: driver
    nodeSelector:
      nvidia.com/gpu.deploy.driver: 'true'
    # attention - the version field for the driver must be fefined without the -rhcos4.6 suffix
  gfd:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.4.1
    image: gpu-feature-discovery
    nodeSelector:
      nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    env:
      - name: GFD_SLEEP_INTERVAL
        value: "60s" 
      - name: FAIL_ON_INIT_ERROR
        value: "true"       
  operator:
    defaultRuntime: crio
    deployGFD: true
    initContainer:
      image: cuda
      repository: dockerregistry.site.example.com/nvidia
      version: 11.4.0-base-ubi8
  validator:
    image: gpu-operator-validator
    repository: dockerregistry.site.example.com/nvidia
    version: v1.7.1
    nodeSelector:
      nvidia.com/gpu.deploy.operator-validator: 'true'
    env:
      - name: WITH_WORKLOAD
        value: "true"
  mig:
    strategy: mixed
  migManager:
    repository: dockerregistry.site.example.com/nvidia
    version: v0.1.0-ubi8
    image: k8s-mig-manager
    nodeSelector:
      nvidia.com/gpu.deploy.mig-manager: 'true'
    env:
      - name: WITH_REBOOT
        value: "false"
  toolkit:
    repository: dockerregistry.site.example.com/nvidia
    version: 1.5.0-ubi8
    image: container-toolkit
    nodeSelector:
    nvidia.com/gpu.deploy.container-toolkit: 'true'

Can you please help us with this issue?

Sep 02 '21 14:09 koflerm

@koflerm Is the issue still persistent? Both plugin validation and cuda validation use same vectorAdd sample, wondering why one succeeded and other failed.

Sep 07 '21 19:09 shivamerla

@shivamerla Yes the problem is still existing. Actually I think the plugin validation was not yet started because also because of the crashing cuda-validator pods the nvidia-operator-validator which includes the plugin-validation as the 4th init container has not started this init container because it is still waiting for the 2nd init container to finish (which waits for the successful start of the cuda-validator pods)

Sep 08 '21 12:09 koflerm

@koflerm can you copy output of nvidia-smi run from any of the pods(plugin, gfd etc)? Also, if you can run this test pod and verify if the necessary files are injected by toolkit. It will be looping in the init container so we can run below commands.

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nvidia-cuda-validator
  generateName: nvidia-cuda-validator-
  namespace: gpu-operator-resources
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  restartPolicy: Never
  serviceAccount: nvidia-operator-validator
  initContainers:
  - name: cuda-validation
    image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.7.1
    imagePullPolicy: IfNotPresent
    command: ['sh', '-c']
    args: ["vectorAdd || sleep inf"]
    securityContext:
      allowPrivilegeEscalation: false
  containers:
    - name: nvidia-cuda-validator
      image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.7.1
      imagePullPolicy: IfNotPresent
      # override command and args as validation is already done by initContainer
      command: ['sh', '-c']
      args: ["echo cuda workload validation is successful && sleep inf"]
      securityContext:
        allowPrivilegeEscalation: false
EOF

oc exec <pod-name> -n gpu-operator-resources -c cuda-validation -- ls -la /dev/nvidia*
oc exec <pod-name> -n gpu-operator-resources -c cuda-validation -- ls -l /usr/lib64 | egrep -i "cuda|libnvidia"

Sep 08 '21 20:09 shivamerla

@shivamerla Here is the output I get when I run nvidia-smi on one of the device-plugin pods:

sh-4.4# nvidia-smi
Thu Sep  9 19:44:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:41:00.0 Off |                    0 |
| N/A   44C    P8    14W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:C4:00.0 Off |                    0 |
| N/A   55C    P8    16W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I also created the pod you mentioned above and it keeps crashing in the initContainer with the following error message:

Failed to allocate device vector A (error code no CUDA-capable device is detected)!
[Vector addition of 50000 elements]

Sep 09 '21 19:09 koflerm

@koflerm sorry i had a typo in the debug pod spec, updated it, can you try now and collect output of below commands.

oc exec <pod-name> -n gpu-operator-resources -c cuda-validation -- ls -la /dev/nvidia*
oc exec <pod-name> -n gpu-operator-resources -c cuda-validation -- ls -l /usr/lib64 | egrep -i "cuda|libnvidia"

Also, after collecting this, can you update the toolkit image to 1.6.0-ubi8 and confirm the issue is still happening.

Sep 10 '21 20:09 shivamerla

@shivamerla Same behaviour so it keeps crashing with the following error message in the cuda-validation init container:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Also, it seems like the toolkit is the problem so the necessary files are not found:

> oc exec nvidia-cuda-validator-test-z4wss -n gpu-operator-resources -c cuda-validation -- ls -la /dev/nvidia*
ls: cannot access '/dev/nvidia*': No such file or directory
command terminated with exit code 2

> oc exec nvidia-cuda-validator-test-z4wss -n gpu-operator-resources -c cuda-validation -- ls -l /usr/lib64 | egrep -i "cuda|libnvidia"
>

Sep 16 '21 08:09 koflerm

@koflerm yes, looks like toolkit error. Can you edit clusterpolicy to use the latest toolkit 1.7.1-ubi8. Also, can you run same commands on GFD pods to see if it can see all GPU devices?

oc exec <pod-name> -n gpu-operator-resources -c gpu-feature-discovery -- ls -la /dev/nvidia*
oc exec <pod-name> -n gpu-operator-resources -c gpu-feature-discovery -- ls -l /usr/lib64 | egrep -i "cuda|libnvidia"

Sep 28 '21 01:09 shivamerla

@shivamerla , follow your instruction of 2 exec above , I get feedback as follows, I think the problem is /dev/nvidia*, [root@bastion william]# oc exec gpu-feature-discovery-pfj7l -n gpu-operator-resources -c gpu-feature-discovery -- ls -la /dev/nvidia* ls: cannot access '/dev/nvidia*': No such file or directory command terminated with exit code 2 [root@bastion william]# oc exec gpu-feature-discovery-pfj7l -n gpu-operator-resources -c gpu-feature-discovery -- ls -l /usr/lib64 | egrep -i "cuda|libnvidia" lrwxrwxrwx. 1 root root 12 Oct 3 04:04 libcuda.so -> libcuda.so.1 lrwxrwxrwx. 1 root root 20 Oct 3 04:04 libcuda.so.1 -> libcuda.so.470.57.02 -rwxr-xr-x. 1 root root 19408552 Sep 23 2020 libcuda.so.450.80.02 -rwxr-xr-x. 1 root root 22267208 Sep 14 09:08 libcuda.so.470.57.02 lrwxrwxrwx. 1 root root 32 Oct 3 04:04 libnvidia-allocator.so.1 -> libnvidia-allocator.so.470.57.02 -rwxr-xr-x. 1 root root 98688 Sep 14 09:08 libnvidia-allocator.so.470.57.02 lrwxrwxrwx. 1 root root 26 Oct 3 04:04 libnvidia-cfg.so.1 -> libnvidia-cfg.so.470.57.02 -rwxr-xr-x. 1 root root 221552 Sep 14 09:08 libnvidia-cfg.so.470.57.02 -rwxr-xr-x. 1 root root 55972120 Sep 14 09:08 libnvidia-compiler.so.470.57.02 lrwxrwxrwx. 1 root root 25 Oct 3 04:04 libnvidia-ml.so.1 -> libnvidia-ml.so.470.57.02 -rwxr-xr-x. 1 root root 1828056 Sep 14 09:08 libnvidia-ml.so.470.57.02 lrwxrwxrwx. 1 root root 29 Oct 3 04:04 libnvidia-opencl.so.1 -> libnvidia-opencl.so.470.57.02 -rwxr-xr-x. 1 root root 18224920 Sep 14 09:08 libnvidia-opencl.so.470.57.02 lrwxrwxrwx. 1 root root 37 Oct 3 04:04 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.470.57.02 -rwxr-xr-x. 1 root root 9947144 Sep 23 2020 libnvidia-ptxjitcompiler.so.450.80.02 -rwxr-xr-x. 1 root root 11144376 Sep 14 09:08 libnvidia-ptxjitcompiler.so.470.57.02

Oct 03 '21 06:10 william0212