azhpc-images icon indicating copy to clipboard operation
azhpc-images copied to clipboard

NC80adis_H100_v5 + ubuntu-hpc:2204: NV links inactive

Open LiYueqian-James opened this issue 2 months ago • 4 comments

Problem

When deploying NC80adis_H100_v5 in Sweden Central/East US using HPC Ubuntu 22.04 marketplace image, I noticed that the 2 GPU were not using NVLink but communicating via PCIe.

Nvidia topology shows no nv links.

nvidia-smi topo -m

        GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    0-39    0               N/A
GPU1    SYS      X      SYS     40-79   1               N/A
NIC0    NODE    SYS      X

On that same machine, if we deploy a vanilla ubuntu 2204 image, and manually install the 570.x driver, we can see NV links just fine.

To build confidence, we start/stop a VM with hpc ubuntu 2204 image 50 times. The VM moved across ~35 different machines. 29/50 iterations, NV links were not seen from nvidia-smi

We then repeated the same experiment for vanilla ubuntu 2204 + manually installed 570.x driver. In every iteration, NV links were active.

There is a small caveat in the driver version used, see #additional details below

Repro

This problem occurs about 50% of the time. If you create 2 NC80adis with the image, you will likely see at least 1 VM with no NV links. I can also provide direct access to the problematic VMs internally.

Mitigation

After a sudo reboot, we can see NV links again.

Additional details

I have noticed that the hpc ubuntu image is using the 570.172.08 driver. When manually installing the driver on a vanilla ubuntu VM, the closest one I could get was 570.195.03, even if I explicitly specified 570.172.08 during installation

wget https://developer.download.nvidia.com/compute/nvidia-driver/570.172.08/local_installers/nvidia-driver-local-repo-ubuntu2204-570.172.08_1.0-1_amd64.deb
sudo dpkg -i nvidia-driver-local-repo-ubuntu2204-570.172.08_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2204-570.172.08/nvidia-driver-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install -y nvidia-open-570=570.172.08-0ubuntu1

LiYueqian-James avatar Oct 17 '25 22:10 LiYueqian-James

Hi James,

Thanks for reporting this.

What do you get on those nodes with NVLinks missing by running:

nvidia-smi nvlink -s

darkwhite29 avatar Oct 18 '25 20:10 darkwhite29

Also, to double confirm, do you think you may do the same experiment on ND_A100 and ND_H100 to prove it won’t happen to these SKUs but only NC series?

darkwhite29 avatar Oct 18 '25 20:10 darkwhite29

Hi James,

Thanks for reporting this.

What do you get on those nodes with NVLinks missing by running:

nvidia-smi nvlink -s

nvidia-smi nvlink -s output

GPU 0: NVIDIA H100 NVL (UUID: GPU-1b79b6b4-9d3d-609c-d31a-33ba36ce851d) NVML: Unable to retrieve NVLink information as all links are inActive GPU 1: NVIDIA H100 NVL (UUID: GPU-dacad1e0-7ae0-2558-7cf1-b8cb4eae699b) NVML: Unable to retrieve NVLink information as all links are inActive

LiYueqian-James avatar Oct 19 '25 05:10 LiYueqian-James

Also, to double confirm, do you think you may do the same experiment on ND_A100 and ND_H100 to prove it won’t happen to these SKUs but only NC series?

Tried the same experiment with Standard_ND96isr_H100_v5 and hpc ubuntu 2204 img. Did not see any issues. NV links always active

LiYueqian-James avatar Oct 19 '25 07:10 LiYueqian-James