Question : is it possible to allow deployment when nvidia-smi returns code different from 0
Hello,
I'm facing some issues trying to make a GPU available in a kubernetes cluster.
Based on my investigations, deployment process stops at nvidia-driver-daemonsets being blocked because the driver-validator which runs nvidia-smi will stop the deployment if the command return code is different from 0 but in some case nvidia-smi output can be different from 0 and still be a non blocking warning (for instance return code 14 should not be blocking.)
Is there a way to allow warnings in nvidia-smi output ?
could you show the output of nvidia-smi; echo $?, out of curiosity?
Yes, here it is :
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 |
| N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
· Return code 0 - Success
· Return code 2 - A supplied argument or flag is invalid
· Return code 3 - The requested operation is not available on target device
· Return code 4 - The current user does not have permission to access this device or perform this operation
· Return code 6 - A query to find an object was unsuccessful
· Return code 8 - A device's external power cables are not properly attached
· Return code 9 - NVIDIA driver is not loaded
· Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
· Return code 12 - NVML Shared Library couldn't be found or loaded
· Return code 13 - Local version of NVML doesn't implement this function
· Return code 14 - infoROM is corrupted
· Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
· Return code 255 - Other error or internal driver error occurred
from https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf
@kpouget yes I have read this doc, but the return code 14 is not blocking workloads since it is just a warning, that is why I'm asking :)
@ymazzer sure, I'm just gathering information around what is happening :)
@ymazzer Yes, this is an interesting issue. Will look into handling these conditions. For now you will need to build(make ubi8) a private image of gpu-operator-validator from here: https://github.com/NVIDIA/gpu-operator/tree/master/validator after changing nvidia-smi references in main.go to more specific command (nvidia-smi -q or nvidia-smi -L) which is not throwing this warning. If nothing works you can use nvidia-smi -h to workaround meanwhile we provide a way to handle this.
@ymazzer Is you have a running driver-container, look for errors/warnings in the device-plugin, this might fail as well if nvidia-smi is providing a non-zero exit code.
Thanks for your support, I managed to make it work with a private build.
Do you plan to make it working natively? should I make a pull request?
@zvonkok I didn't use the device plugin since the driver and the toolkit are installed on the host.
@shivamerla , we got this issue in AWS as well, as part of a CI run:
[pod/nvidia-operator-validator-k2kq9/driver-validation] Wed Dec 1 13:54:20 2021
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |-------------------------------+----------------------+----------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | | | MIG M. |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |===============================+======================+======================|
[pod/nvidia-operator-validator-k2kq9/driver-validation] | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | N/A 33C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | | | N/A |
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-------------------------------+----------------------+----------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation]
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] | Processes: |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | GPU GI CI PID Type Process name GPU Memory |
[pod/nvidia-operator-validator-k2kq9/driver-validation] | ID ID Usage |
[pod/nvidia-operator-validator-k2kq9/driver-validation] |=============================================================================|
[pod/nvidia-operator-validator-k2kq9/driver-validation] | No running processes found |
[pod/nvidia-operator-validator-k2kq9/driver-validation] +-----------------------------------------------------------------------------+
[pod/nvidia-operator-validator-k2kq9/driver-validation] WARNING: infoROM is corrupted at gpu 0000:00:1E.0
[pod/nvidia-operator-validator-k2kq9/driver-validation] command failed, retrying after 5 seconds
with all the Pods (except Driver) waiting for the driver gate to open:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-feature-discovery-8jj6g 0/1 Init:0/1 0 37m 10.130.2.2 ip-10-0-250-81.ec2.internal <none> <none>
gpu-operator-865c9fd8b9-qrxb4 1/1 Running 0 14m 10.128.2.15 ip-10-0-183-168.ec2.internal <none> <none>
nvidia-container-toolkit-daemonset-n5w7r 0/1 Init:0/1 0 37m 10.130.2.5 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-dcgm-exporter-h65sx 0/1 Init:0/2 0 37m 10.130.2.3 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-dcgm-nvqn8 0/1 Init:0/1 0 37m 10.0.250.81 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-device-plugin-daemonset-f2k62 0/1 Init:0/1 0 37m 10.130.2.4 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-driver-daemonset-pp4mv 1/1 Running 0 37m 10.130.2.7 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-node-status-exporter-vpfbn 1/1 Running 0 37m 10.130.2.8 ip-10-0-250-81.ec2.internal <none> <none>
nvidia-operator-validator-k2kq9 0/1 Init:0/4 0 37m 10.130.2.6 ip-10-0-250-81.ec2.internal <none> <none>
@kpouget created PR for this: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/354
Not likely that it will get into v1.9.0 release, will update.
@kpouget can you try out with this validator image registry.gitlab.com/shivamerla/gpu-operator/gpu-operator-validator:14ab83e6c2901d94736a52c12f82c9ce0b21a569-ubi8 and passing ENV as done in the PR. We need to make sure plugin/GFD can handle these errors.
@shivamerla, the image you linked is the build of the PR, right? I'll give it a try tomorrow in the CI (we build the validator image there)
yes, thats correct its from the CI.
@shivamerla, couldn't the driver Pod check for nvidia-smi to return 0 before going for its infinite sleep?
currently, on a broken GPU, the driver Pod is happy, not mentioning any issue:
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] Done, now waiting for signal
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + trap 'echo '\''Caught signal'\''; _shutdown && { kill 17979; exit 0; }' HUP INT QUIT PIPE TERM
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + sleep infinity
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + trap - EXIT
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + true
[pod/nvidia-driver-daemonset-8l7bp/nvidia-driver-ctr] + wait 17979
but the validation test fails:
WARNING: infoROM is corrupted at gpu 0000:00:1E.0
command failed, retrying after 5 seconds
and this log isn't even captured in by the must-gather script,
I'm not exactly sure why but the validation pod log ends up empty (certainly because the validation is done in an init pod, but I'm not sure what it's not captured by the -all-containers --prefix flags)
I think it could be simple and efficient to block the driver Pod from reporting as ready, and have the WARNING: infoROM is corrupted at gpu 0000:00:1E.0 in the logs