TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Multi-node inference: invalid device ordinal

Open thies1006 opened this issue 9 months ago • 2 comments

System Info

NCCL version 2.19.3+cuda12.0 TensorRT-LLM version: 0.11.0.dev2024052100 Ubuntu 22.04

Who can help?

@byshiue

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I try to run Llama3-70B on two nodes which have 8 L4 GPUs each. For this purpose I converted the model using TP=8, PP=2. I checked that the mapping in config.json is correct: "world_size": 16 "tp_size": 8, "pp_size": 2, "gpus_per_node": 8

Now when I run the example script with mpirun like this: mpirun -H $HOST1:8,$HOST2:8 -n 16 python3 examples/run.py ...

Expected behavior

script runs without errors

actual behavior

I see errors like these: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceCanAccessPeer(&canAccessPeer, firstDeviceId, secondDeviceId): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:341)

additional notes

I made sure that nccl-tests where running and there have been no problems.

thies1006 avatar May 24 '24 15:05 thies1006