TensorRT-LLM Incorrect GPU Assignment in MPI Inter-Node Processing with Single GPU Nodes

System Info

CPU architecture x86_64
GPU name NVIDIA Tesla T4
TensorRT-LLM v0.9.0
NVIDIA driver version 12.3
OS Ubuntu 22.04.3

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd examples/llama
convert checkpoint & trt-llm build engines (tp = 2)
password-less SSH access between containers
mpirun -n 2 --hostfile hostfile.txt --allow-run-as-root python3 ../run.py --max_output_len=160 --tokenizer_dir /host/huggingface/Llama-2-7b-chat-hf/ --engine_dir /tmp/kunlun/cache/models/trt_engines/llama/fp16/tp2-2gpu/ --input_text "In python, write a function for binary searching an element in an integer array."

Expected behavior

gpu index should be assigned based on rank % actual_number_of_GPUs.

actual behavior

Rank 1 is incorrectly assigned to GPU 1, based on the assumption of 8 GPUs per node. ERROR Message: RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/release-0.9/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)

additional notes

Description of the Issue: When running an MPI inter-node process with one T4 GPU per node (llama2-7b), a problem arises with two machines each equipped with a single GPU. With two total ranks (tp=2), rank 1 attempts to access GPU 1, which is incorrect. Upon code inspection, it appears that both the C++ and Python implementations default to assuming there are 8 GPUs per node (gpus_per_node set to 8). However, it should dynamically allocate GPUs based on the actual number of available GPUs on the node, using the formula rank % num_gpus.

Apr 24 '24 09:04 littlefatfat

I will submit a suggestion for a fix soon. The Python implementation works successfully after the change, allowing for successful inference. The C++ version avoids the initial error after modification, but triggers other errors, including an NCCL error that can be debugged using NCCL_DEBUG=INFO for more details: Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'unhandled system error (run with NCCL_DEBUG=INFO for details)' sinian-t4-devel:4530:4530 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer tensorrt_llm_1gpu-devel-ruiyanm.test_network<33936>

Apr 24 '24 09:04 littlefatfat

@Funatiq, could you please take a look at the PR?

May 13 '24 13:05 MartinMarciniszyn

gpus_per_node can be defined when building the engine . It is also stored in the config.json file. Can you please try to set the desired value in the config file or build the engine with the according parameter?

May 17 '24 08:05 Funatiq