nvidia_gpu_prometheus_exporter
nvidia_gpu_prometheus_exporter copied to clipboard
dutyCycle loses data
dutyCycle is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.
https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t
So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost: Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.
Instead it would be better to have a ..._seconds_total counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).
Yes, that's why it's marked as a Gauge and not as a Counter. Unfortunately, the NVML API doesn't provide counter like values.
You can potentially use https://github.com/mindprince/gonvml/blob/b364b296c7320f5d3dc084aa536a3dba33b68f90/bindings.go#L250-L266 but that would make the exporter more complicated.