server icon indicating copy to clipboard operation
server copied to clipboard

nv_inference_count no longer includes gpu_uuid?

Open chriscarollo opened this issue 1 year ago • 3 comments

I have some grafana graphs using Triton's prometheus metrics, and it appears that in a semi-recent update that nv_inference_count no longer includes a gpu_uuid field (I see only "model" and "version"). I have a graph showing the number of inferences per gpu, which no longer works.

chriscarollo avatar Jul 26 '24 18:07 chriscarollo

Hi @chriscarollo, have you used the tritonserver --model-control-mode EXPLICIT ... (or POLL) feature to dynamically load/unload models before? I believe there may be a known inconsistency where models loaded at startup have no GPU_ID label for non-GPU metrics, and models dynamically loaded later on after server has started do have these GPU_ID labels applied to other non-GPU related metrics.

Please let me know if you can consistently identify or reproduce this behavior one way or the other.

rmccorm4 avatar Jul 31 '24 22:07 rmccorm4

I'm actually using model-control-mode POLL and it does appear that my gpu_id labels did come back after it detected new versions. So it does look like maybe only an issue on initial startup?

chriscarollo avatar Jul 31 '24 22:07 chriscarollo

Hi @chriscarollo, this is a known issue and has a proposed resolution in this PR: https://github.com/triton-inference-server/core/pull/321. Please chime in on the discussion with your use case, impact, etc.

rmccorm4 avatar Aug 01 '24 19:08 rmccorm4

Hi, this bug is affecting us. We recently switched from poll mode explicit to poll mode none, and unfortunately, this change broke our Grafana dashboards 😞 Any estimation on a timeline for a fix?

itaispiegel avatar Feb 16 '25 08:02 itaispiegel