nvidia-exporter
nvidia-exporter copied to clipboard
Failed to collect metrics: nvml: Not Supported
Hi @BugRoger
When starting the exporter in k8s, the log alway says:
Failed to collect metrics: nvml: Not Supported
And below is the result of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59 Driver Version: 390.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000460E:00:00.0 Off | 0 |
| N/A 37C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00006180:00:00.0 Off | 0 |
| N/A 33C P8 33W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
While this error does not occur on another GPU machine which using GTX 1080.
Any clues or suggestion?
Additionally, compared with a similar gpu-exporter, I find it meets the same using tesla issue, while it still can work.
And seems it has claimed by nvidia officially:
https://github.com/NVIDIA/nvidia-docker/issues/40 https://github.com/ComputationalRadiationPhysics/cuda_memtest/issues/16
So can we unblock it ?
+1
nvml: Not Supported
+1 Tesla: 2019/10/25 11:24:19 Failed to collect metrics: nvml: Not Supported
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
On GTX 1060, 1070 works fine.
Will be submitting a PR shortly; however to quickly explain the issue. It looks like not all metrics are supported on Tesla
graphics card via NVML (there are likely other GPUs also) - however the exporter handles this by returning out instead of continuing to collect other metrics.
Example:
fanSpeed, err := device.FanSpeed()
if err != nil {
return nil, err
}
Instead of return nil, err
we should just log (catch) the event - and do something that does not interrupt the routine.
So currently, as seen on you're nvidia-smi
output's anything that has N/A
would currently cause the above error and interrupt collection.