azhpc-images icon indicating copy to clipboard operation
azhpc-images copied to clipboard

NCCL graph and topology incompatible with A100

Open r-b-g-b opened this issue 3 months ago • 2 comments

I'm using the ubuntu-hpc 2204 x64 Gen 2 image on a Standard NC24ads A100 v4 VM.

I train a vLLM model that uses NCCL and observe the following error:

Error

::16674:16674 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0> ::16674:16674 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ::16674:16674 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ::16674:16674 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.6+cuda11.8 ::16674:17023 [0] NCCL INFO NET/IB : Using [0]mlx5_an0:1/RoCE [RO]; OOB eth0:10.1.0.4<0> ::16674:17023 [0] NCCL INFO Using network IB ::16674:17023 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x2064c00fc2f91516 - Init START ::16674:17023 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml ::16674:17023 [0] NCCL INFO Setting affinity for GPU 0 to ffffff ::16674:17023 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml

::16674:17023 [0] graph/search.cc:703 NCCL WARN XML Import Channel : dev 1 not found. ::16674:17023 [0] NCCL INFO graph/search.cc:733 -> 2 ::16674:17023 [0] NCCL INFO graph/search.cc:740 -> 2 ::16674:17023 [0] NCCL INFO graph/search.cc:840 -> 2 ::16674:17023 [0] NCCL INFO init.cc:880 -> 2 ::16674:17023 [0] NCCL INFO init.cc:1358 -> 2 ::16674:17023 [0] NCCL INFO group.cc:65 -> 2 [Async thread] ::16674:16674 [0] NCCL INFO group.cc:406 -> 2 ::16674:16674 [0] NCCL INFO group.cc:96 -> 2 Traceback (most recent call last):

...

self.llm_engine = LLMEngine.from_engine_args(engine_args)

File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args engine = cls(*engine_configs, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in init self._init_workers() File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 151, in _init_workers self._run_workers("init_model") File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts)

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1702400366987/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: XML Import Channel : dev 1 not found. ::16674:16674 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE

This is a single GPU machine, but /opt/microsoft/ncv4/graph.xml and topology.xml reference 4 GPUs. If I update them to refer to a single GPU, everything works.

graph.xml
<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="2" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="4" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="4" speedintra="12" speedinter="12" latencyinter="0" typeintra="SYS" typeinter="PIX" samechannels="0">
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
    <channel>
      <gpu dev="0"/>
    </channel>
  </graph>
</graphs>
topology.xml
<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00ffffff" arch="x86_64" vendor="AuthenticAMD" familyid="175" modelid="1">
    <pci busid="0001:00:00.0" class="0x030200" vendor="0x10de" device="0x20b5" subsystem_vendor="0x10de" subsystem_device="0x1533" link_speed="" link_width="0">
      <gpu dev="0" sm="80" rank="0" gdr="1">
        <nvlink target="0002:00:00.0" count="12" tclass="0x030200"/>
      </gpu>
    </pci>
    <nic>
      <net name="eth0" dev="0" speed="100000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
</system>

r-b-g-b avatar Mar 15 '24 19:03 r-b-g-b