nvidia_gpu_prometheus_exporter dutyCycle loses data

dutyCycle is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:

Percent of time over the past sample period during which one or more kernels was executing on the GPU.

Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost: Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.

Instead it would be better to have a ..._seconds_total counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).

Jul 16 '19 16:07 michaelkoetter

Yes, that's why it's marked as a Gauge and not as a Counter. Unfortunately, the NVML API doesn't provide counter like values.

Jul 16 '19 17:07 rohitagarwal003

You can potentially use https://github.com/mindprince/gonvml/blob/b364b296c7320f5d3dc084aa536a3dba33b68f90/bindings.go#L250-L266 but that would make the exporter more complicated.

Jul 16 '19 17:07 rohitagarwal003

nvidia_gpu_prometheus_exporter nvidia_gpu_prometheus_exporter copied to clipboard

dutyCycle loses data

nvidia_gpu_prometheus_exporter
nvidia_gpu_prometheus_exporter copied to clipboard