dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Missing NVSwitch Bandwidth (RX / TX) in dcgm-exporter

Open suranchoi opened this issue 11 months ago • 4 comments

Ask your question

Version

  • GPU : H200
  • Server : XD 670
  • GPU Driver : 565.57
  • dcgm-exporter : 3.3.8 (build binary file)

Hello, I’m running a custom server (not a DGX) that has 8x NVIDIA H100 GPUs connected via NVSwitch. I’m using dcgm-exporter to monitor GPU metrics. Additionally, to verify the NVSwitch bandwidth, I ran NCCL tests and also performed model training using DDP. While I can see NVLink traffic clearly, the NVSwitch traffic metrics remain at 0

I also have dcgmi installed, which runs but doesn’t appear to expose any NVSwitch-specific data.

Could you clarify whether dcgm-exporter (or DCGM in general) supports NVSwitch metrics on non-DGX servers? If so, are there any extra steps or configurations needed to enable this? If not, is there another recommended approach or tool to measure NVSwitch traffic on a non-DGX system?

suranchoi avatar Jan 24 '25 08:01 suranchoi

Which metrics are you monitoring? Can you attach the output of the exporter? Is the libnvidia-nscq library installed?

glowkey avatar Jan 24 '25 20:01 glowkey

The metrics is "DCGM_FI_DEV_NVSWITCH_THROUGHPUT_TX/RX" or "DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_TX/RX" (I don't remember exactly)

This is a customer environment, so I can not attach the output..

AND "libnvidia-nscq library (matched GPU Driver version) is installed.

suranchoi avatar Jan 26 '25 12:01 suranchoi

I also encountered this issue. Is there a solution?

Leefs avatar Jun 18 '25 02:06 Leefs

@Leefs I haven't tested it yet, but I heard that if you raise GPU Driver to 580 version, the bug will be fixed.

suranchoi avatar Jun 23 '25 05:06 suranchoi