server Enhancement Request: Additional GPU Information in Prometheus Metrics

Enhancement Request: Additional GPU Information in Prometheus Metrics

Open levipereira opened this issue 9 months ago • 5 comments

Is your feature request related to a problem? Please describe. no

Currently, the triton-server provides GPU utilization metrics in Prometheus format, like so:

# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-3fed825f-252b-32ea-e3d7-266c45b62ce7"} 0

I would like to request the inclusion of additional information, specifically the GPU number and GPU name, similar to what can be obtained using nvidia-smi -L. This information would greatly aid in creating dynamic Grafana dashboards without the need to consult additional identification information on the physical host.

Example output of nvidia-smi -L:

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c8a1aa60-c24c-5ce2-fc43-068d14542d00)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-04727ce0-d35e-c535-9a43-b989af8d016f)

Including the GPU number and GPU name in the Prometheus metrics would improve the user experience and ease the dynamic creation of monitoring dashboards.

Thank you for considering this enhancement request.

Best regards, Levi Pereira

Oct 04 '23 19:10 levipereira

I'm going to take a crack at this.

Dec 20 '23 15:12 ClifHouck

@rmccorm4, what are your thoughts on this feature request? Let me know if you would like me to open a ticket.

Feb 20 '24 17:02 dyastremsky

@ClifHouck, did you have success with this enhancement? Thanks for working on this!

Feb 20 '24 17:02 dyastremsky

@dyastremsky Yes, but I ran into this bug: https://github.com/triton-inference-server/server/issues/6815

I've opened a PR to address it: https://github.com/triton-inference-server/core/pull/321

I was waiting for that to be resolved before opening another PR to address this issue.

Feb 20 '24 17:02 ClifHouck

Thanks for letting me know, Clif. I'll take a look.

Feb 20 '24 19:02 dyastremsky

server server copied to clipboard

Enhancement Request: Additional GPU Information in Prometheus Metrics

server
server copied to clipboard