Missing NVSwitch Bandwidth (RX / TX) in dcgm-exporter
Ask your question
Version
- GPU : H200
- Server : XD 670
- GPU Driver : 565.57
- dcgm-exporter : 3.3.8 (build binary file)
Hello, I’m running a custom server (not a DGX) that has 8x NVIDIA H100 GPUs connected via NVSwitch. I’m using dcgm-exporter to monitor GPU metrics. Additionally, to verify the NVSwitch bandwidth, I ran NCCL tests and also performed model training using DDP. While I can see NVLink traffic clearly, the NVSwitch traffic metrics remain at 0
I also have dcgmi installed, which runs but doesn’t appear to expose any NVSwitch-specific data.
Could you clarify whether dcgm-exporter (or DCGM in general) supports NVSwitch metrics on non-DGX servers? If so, are there any extra steps or configurations needed to enable this? If not, is there another recommended approach or tool to measure NVSwitch traffic on a non-DGX system?
Which metrics are you monitoring? Can you attach the output of the exporter? Is the libnvidia-nscq library installed?
The metrics is "DCGM_FI_DEV_NVSWITCH_THROUGHPUT_TX/RX" or "DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_TX/RX" (I don't remember exactly)
This is a customer environment, so I can not attach the output..
AND "libnvidia-nscq library (matched GPU Driver version) is installed.
I also encountered this issue. Is there a solution?
@Leefs I haven't tested it yet, but I heard that if you raise GPU Driver to 580 version, the bug will be fixed.