nvlink metrics are not available on the gh200 gpu node
Ask your question
I am running dcgm-exporter within docker container on a GH200 gpu node. However, the dcgm-exporter is not able to discover NvSwitch and NvLink devices and as a result doesn't export any NvLink metrics. I'm using nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 dcgm-exporter image which is latest. Does this latest version of dcgm-exporter support NvLink metrics on GH200 gpu node? If yes, is there any extra configuration required to get NvLink metrics?
Below is the dcgm-exporter container logs:
sudo docker logs dcgm-exporter 2024/05/31 03:42:29 maxprocs: Leaving GOMAXPROCS=72: CPU quota undefined time="2024-05-31T03:42:29Z" level=info msg="Starting dcgm-exporter" time="2024-05-31T03:42:29Z" level=info msg="DCGM successfully initialized!" time="2024-05-31T03:42:29Z" level=info msg="Collecting DCP Metrics" time="2024-05-31T03:42:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: GPU" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: NvSwitch" time="2024-05-31T03:42:29Z" level=info msg="Not collecting NvSwitch metrics; no switches to monitor" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: NvLink" time="2024-05-31T03:42:29Z" level=info msg="Not collecting NvLink metrics; no switches to monitor"
Note that nvswitches and nvlinks may not automatically be mounted inside the container. See https://github.com/NVIDIA/dcgm-exporter/issues/316#issuecomment-2087369233
Thank you for your reply. I tried mounting nvswitches and nvlinks devices to the dcgm-exporter container by following https://github.com/NVIDIA/dcgm-exporter/issues/169#issuecomment-1604771610. However, I don't see nvidia-nvswitch* and nvidia-nvlink device files under /dev directory on GH200 nodes. I also tried running dcgm-exporter binary (built from source) and it still couldn't discover the nvswitches and nvlinks.