Incorrect GPU Assignment in MPI Inter-Node Processing with Single GPU Nodes
System Info
- CPU architecture x86_64
- GPU name NVIDIA Tesla T4
- TensorRT-LLM v0.9.0
- NVIDIA driver version 12.3
- OS Ubuntu 22.04.3
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- cd examples/llama
- convert checkpoint & trt-llm build engines (tp = 2)
- password-less SSH access between containers
- mpirun -n 2 --hostfile hostfile.txt --allow-run-as-root python3 ../run.py --max_output_len=160 --tokenizer_dir /host/huggingface/Llama-2-7b-chat-hf/ --engine_dir /tmp/kunlun/cache/models/trt_engines/llama/fp16/tp2-2gpu/ --input_text "In python, write a function for binary searching an element in an integer array."
Expected behavior
gpu index should be assigned based on rank % actual_number_of_GPUs.
actual behavior
Rank 1 is incorrectly assigned to GPU 1, based on the assumption of 8 GPUs per node. ERROR Message: RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)
additional notes
Description of the Issue: When running an MPI inter-node process with one T4 GPU per node (llama2-7b), a problem arises with two machines each equipped with a single GPU. With two total ranks (tp=2), rank 1 attempts to access GPU 1, which is incorrect. Upon code inspection, it appears that both the C++ and Python implementations default to assuming there are 8 GPUs per node (gpus_per_node set to 8). However, it should dynamically allocate GPUs based on the actual number of available GPUs on the node, using the formula rank % num_gpus.
I will submit a suggestion for a fix soon. The Python implementation works successfully after the change, allowing for successful inference. The C++ version avoids the initial error after modification, but triggers other errors, including an NCCL error that can be debugged using NCCL_DEBUG=INFO for more details: Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'unhandled system error (run with NCCL_DEBUG=INFO for details)' sinian-t4-devel:4530:4530 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer tensorrt_llm_1gpu-devel-ruiyanm.test_network<33936>
@Funatiq, could you please take a look at the PR?
gpus_per_node can be defined when building the engine . It is also stored in the config.json file. Can you please try to set the desired value in the config file or build the engine with the according parameter?