dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Should DCGM_FI_DEV_COUNT metric be a counter or a gauge?

Open iliakur opened this issue 2 years ago • 3 comments

DCGM_FI_DEV_COUNT metric is exposed as a counter, here's an example response:

# HELP DCGM_FI_DEV_COUNT Number of Devices on the node.
# TYPE DCGM_FI_DEV_COUNT counter
DCGM_FI_DEV_COUNT{gpu="0",UUID="GPU-8afe0f31-4207-33ec-7e08-af8774375fee",device="nvidia0",modelName="NVIDIA H100 PCIe",Hostname="iscxh001.mskcc.org",DCGM_FI_CUDA_DRIVER_VERSION="12020",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="0",DCGM_FI_DEV_NAME="NVIDIA H100 PCIe",DCGM_FI_DEV_SERIAL="1650723017032",DCGM_FI_DRIVER_VERSION="535.104.12",DCGM_FI_PROCESS_NAME="/usr/local/sbin/dcgm-exporter"} 4

According to the docs for counter metrics the value is supposed to be increasing monotonically

What if one of the devices goes offline for some reason? Won't this decrease the value? Conceptually this looks more like a gauge to me.

iliakur avatar Dec 29 '23 15:12 iliakur

@nvvfedorov could you share some details on what "closed as completed" means in this case?

iliakur avatar Mar 18 '24 12:03 iliakur

@iliakur, thank you for confirming that the issue is active.

nvvfedorov avatar Mar 18 '24 14:03 nvvfedorov

@iliakur, I went through the dcgm-exporter code base, and I don't see that we expose the "DCGM_FI_DEV_COUNT" metric in the files we ship as part of the release. Please let me know where you found the 'DCGM_FI_DEV_COUNT' metrics. What is the dcgm-exporter version, and how was it deployed?

nvvfedorov avatar Mar 18 '24 16:03 nvvfedorov