gpu-operator Question : is it possible to allow deployment when nvidia-smi returns code different from 0

Hello,

I'm facing some issues trying to make a GPU available in a kubernetes cluster.

Based on my investigations, deployment process stops at nvidia-driver-daemonsets being blocked because the driver-validator which runs nvidia-smi will stop the deployment if the command return code is different from 0 but in some case nvidia-smi output can be different from 0 and still be a non blocking warning (for instance return code 14 should not be blocking.)

Is there a way to allow warnings in nvidia-smi output ?

Nov 25 '21 12:11 ymazzer

could you show the output of nvidia-smi; echo $?, out of curiosity?

Nov 25 '21 12:11 kpouget

Yes, here it is :

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

Nov 25 '21 12:11 ymazzer

RETURN VALUE

       Return code reflects whether the operation succeeded or failed and what
       was the reason of failure.

       Â·      Return code 0 - Success

       Â·      Return code 2 - A supplied argument or flag is invalid
       Â·      Return code 3 - The requested operation is not available on target device
       Â·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
       Â·      Return code 6 - A query to find an object was unsuccessful
       Â·      Return code 8 - A device's external power cables are not properly attached
       Â·      Return code 9 - NVIDIA driver is not loaded
       Â·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
       Â·      Return code 12 - NVML Shared Library couldn't be found or loaded
       Â·      Return code 13 - Local version of NVML  doesn't  implement  this function
       Â·      Return code 14 - infoROM is corrupted
       Â·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
       Â·      Return code 255 - Other error or internal driver error occurred

from https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

Nov 25 '21 12:11 kpouget

@kpouget yes I have read this doc, but the return code 14 is not blocking workloads since it is just a warning, that is why I'm asking :)

Nov 25 '21 13:11 ymazzer

@ymazzer sure, I'm just gathering information around what is happening :)

Nov 25 '21 13:11 kpouget

@ymazzer Yes, this is an interesting issue. Will look into handling these conditions. For now you will need to build(make ubi8) a private image of gpu-operator-validator from here: https://github.com/NVIDIA/gpu-operator/tree/master/validator after changing nvidia-smi references in main.go to more specific command (nvidia-smi -q or nvidia-smi -L) which is not throwing this warning. If nothing works you can use nvidia-smi -h to workaround meanwhile we provide a way to handle this.

Nov 29 '21 00:11 shivamerla

@ymazzer Is you have a running driver-container, look for errors/warnings in the device-plugin, this might fail as well if nvidia-smi is providing a non-zero exit code.

Nov 29 '21 06:11 zvonkok

Thanks for your support, I managed to make it work with a private build.

Do you plan to make it working natively? should I make a pull request?

@zvonkok I didn't use the device plugin since the driver and the toolkit are installed on the host.

Nov 30 '21 12:11 ymazzer

@shivamerla , we got this issue in AWS as well, as part of a CI run:

[pod/nvidia-operator-validator-k2kq9/driver-validation] Wed Dec  1 13:54:20 2021       
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |-------------------------------+----------------------+----------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |                               |                      |               MIG M. |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |===============================+======================+======================|
[pod/nvidia-operator-validator-k2kq9/driver-validation] |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | N/A   33C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |                               |                      |                  N/A |
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-------------------------------+----------------------+----------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation]                                                                                
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | Processes:                                                                  |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |        ID   ID                                                   Usage      |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |=============================================================================|
[pod/nvidia-operator-validator-k2kq9/driver-validation] |  No running processes found                                                 |
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] WARNING: infoROM is corrupted at gpu 0000:00:1E.0
[pod/nvidia-operator-validator-k2kq9/driver-validation] command failed, retrying after 5 seconds

with all the Pods (except Driver) waiting for the driver gate to open:

NAME                                       READY   STATUS     RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
gpu-feature-discovery-8jj6g                0/1     Init:0/1   0          37m   10.130.2.2    ip-10-0-250-81.ec2.internal    <none>           <none>
gpu-operator-865c9fd8b9-qrxb4              1/1     Running    0          14m   10.128.2.15   ip-10-0-183-168.ec2.internal   <none>           <none>
nvidia-container-toolkit-daemonset-n5w7r   0/1     Init:0/1   0          37m   10.130.2.5    ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-dcgm-exporter-h65sx                 0/1     Init:0/2   0          37m   10.130.2.3    ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-dcgm-nvqn8                          0/1     Init:0/1   0          37m   10.0.250.81   ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-device-plugin-daemonset-f2k62       0/1     Init:0/1   0          37m   10.130.2.4    ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-driver-daemonset-pp4mv              1/1     Running    0          37m   10.130.2.7    ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-node-status-exporter-vpfbn          1/1     Running    0          37m   10.130.2.8    ip-10-0-250-81.ec2.internal    <none>           <none>
nvidia-operator-validator-k2kq9            0/1     Init:0/4   0          37m   10.130.2.6    ip-10-0-250-81.ec2.internal    <none>           <none>

Dec 01 '21 14:12 kpouget

@kpouget created PR for this: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/354

Not likely that it will get into v1.9.0 release, will update.

Dec 01 '21 18:12 shivamerla

@kpouget can you try out with this validator image registry.gitlab.com/shivamerla/gpu-operator/gpu-operator-validator:14ab83e6c2901d94736a52c12f82c9ce0b21a569-ubi8 and passing ENV as done in the PR. We need to make sure plugin/GFD can handle these errors.

Dec 01 '21 20:12 shivamerla

@shivamerla, the image you linked is the build of the PR, right? I'll give it a try tomorrow in the CI (we build the validator image there)

Dec 01 '21 20:12 kpouget

yes, thats correct its from the CI.

Dec 01 '21 20:12 shivamerla

@shivamerla, couldn't the driver Pod check for nvidia-smi to return 0 before going for its infinite sleep?

currently, on a broken GPU, the driver Pod is happy, not mentioning any issue:

[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] Done, now waiting for signal
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + trap 'echo '\''Caught signal'\''; _shutdown && { kill 17979; exit 0; }' HUP INT QUIT PIPE TERM
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + sleep infinity
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + trap - EXIT
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + true
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + wait 17979

but the validation test fails:

WARNING: infoROM is corrupted at gpu 0000:00:1E.0
command failed, retrying after 5 seconds

and this log isn't even captured in by the must-gather script, I'm not exactly sure why but the validation pod log ends up empty (certainly because the validation is done in an init pod, but I'm not sure what it's not captured by the -all-containers --prefix flags)

I think it could be simple and efficient to block the driver Pod from reporting as ready, and have the WARNING: infoROM is corrupted at gpu 0000:00:1E.0 in the logs

Feb 21 '22 20:02 kpouget