TensorRT-LLM
TensorRT-LLM copied to clipboard
Multi-node inference: invalid device ordinal
System Info
NCCL version 2.19.3+cuda12.0 TensorRT-LLM version: 0.11.0.dev2024052100 Ubuntu 22.04
Who can help?
@byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I try to run Llama3-70B on two nodes which have 8 L4 GPUs each. For this purpose I converted the model using TP=8, PP=2. I checked that the mapping in config.json is correct: "world_size": 16 "tp_size": 8, "pp_size": 2, "gpus_per_node": 8
Now when I run the example script with mpirun like this:
mpirun -H $HOST1:8,$HOST2:8 -n 16 python3 examples/run.py ...
Expected behavior
script runs without errors
actual behavior
I see errors like these:
[TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceCanAccessPeer(&canAccessPeer, firstDeviceId, secondDeviceId): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:341)
additional notes
I made sure that nccl-tests where running and there have been no problems.