NC80adis_H100_v5 + ubuntu-hpc:2204: NV links inactive
Problem
When deploying NC80adis_H100_v5 in Sweden Central/East US using HPC Ubuntu 22.04 marketplace image, I noticed that the 2 GPU were not using NVLink but communicating via PCIe.
Nvidia topology shows no nv links.
nvidia-smi topo -m
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE 0-39 0 N/A
GPU1 SYS X SYS 40-79 1 N/A
NIC0 NODE SYS X
On that same machine, if we deploy a vanilla ubuntu 2204 image, and manually install the 570.x driver, we can see NV links just fine.
To build confidence, we start/stop a VM with hpc ubuntu 2204 image 50 times. The VM moved across ~35 different machines. 29/50 iterations, NV links were not seen from nvidia-smi
We then repeated the same experiment for vanilla ubuntu 2204 + manually installed 570.x driver. In every iteration, NV links were active.
There is a small caveat in the driver version used, see #additional details below
Repro
This problem occurs about 50% of the time. If you create 2 NC80adis with the image, you will likely see at least 1 VM with no NV links. I can also provide direct access to the problematic VMs internally.
Mitigation
After a sudo reboot, we can see NV links again.
Additional details
I have noticed that the hpc ubuntu image is using the 570.172.08 driver. When manually installing the driver on a vanilla ubuntu VM, the closest one I could get was 570.195.03, even if I explicitly specified 570.172.08 during installation
wget https://developer.download.nvidia.com/compute/nvidia-driver/570.172.08/local_installers/nvidia-driver-local-repo-ubuntu2204-570.172.08_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-570.172.08_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2204-570.172.08/nvidia-driver-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install -y nvidia-open-570=570.172.08-0ubuntu1
Hi James,
Thanks for reporting this.
What do you get on those nodes with NVLinks missing by running:
nvidia-smi nvlink -s
Also, to double confirm, do you think you may do the same experiment on ND_A100 and ND_H100 to prove it won’t happen to these SKUs but only NC series?
Hi James,
Thanks for reporting this.
What do you get on those nodes with NVLinks missing by running:
nvidia-smi nvlink -s
nvidia-smi nvlink -s output
GPU 0: NVIDIA H100 NVL (UUID: GPU-1b79b6b4-9d3d-609c-d31a-33ba36ce851d) NVML: Unable to retrieve NVLink information as all links are inActive GPU 1: NVIDIA H100 NVL (UUID: GPU-dacad1e0-7ae0-2558-7cf1-b8cb4eae699b) NVML: Unable to retrieve NVLink information as all links are inActive
Also, to double confirm, do you think you may do the same experiment on ND_A100 and ND_H100 to prove it won’t happen to these SKUs but only NC series?
Tried the same experiment with Standard_ND96isr_H100_v5 and hpc ubuntu 2204 img. Did not see any issues. NV links always active