DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

H100 system reports zero values for NVSwitch metrics DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_{TX,RX}

Open Leefs opened this issue 7 months ago • 0 comments

We are using DCGM to monitor NVSwitch performance on a system equipped with NVIDIA H100 GPUs (HGX platform). While NVLink-related metrics such as DCGM_FI_PROF_NVLINK_TX_BYTES and DCGM_FI_PROF_NVLINK_RX_BYTES report valid values, the NVSwitch-specific metrics:

DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_TX

DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_RX

consistently report zero (0) values across all switches and links.

We have confirmed that the system includes NVSwitch interconnects via nvidia-smi topo --matrix, and the workload in question involves intensive GPU-to-GPU communication (via NCCL AllReduce), which should clearly traverse the NVSwitch fabric.

This suggests that either:

The above metrics are not currently being populated on H100-based systems;

Or DCGM may not yet support NVSwitch link throughput counters for this architecture.

Can you please confirm whether these fields are expected to work on H100 systems with NVSwitch? If support is still in progress, is there a timeline or recommended workaround?

Thank you!

Leefs avatar Jun 17 '25 09:06 Leefs