TensorRT-LLM RuntimeError: Can't enable access between nodes 1 and 0

Hi， I tried to convert T5 model to tensorrt. I have a 4 GPUs devices.In the python convert_checkpoint.py step,I set tp_size=4,pp_size=1.Then I got tensorrt model successfully.However,when I use command :mpirun --allow-run-as-root -np 4 python3 run.py ,I got those errors

when I set tp_size=1,pp_size=1 in the python convert_checkpoint.py step,I can run python3 run.py successfully. So how can I fixed this problem?It seems to be related with GPU setting,but I don't know how to do that. I also found a similar issue but when I added --use_custom_all_reduce disable in trtllm-build,it showed unrecognized arguments

Jul 31 '24 09:07 EASTERNTIGER

same problem，seems that this argument has been removed #2008

Aug 08 '24 01:08 OptimusV5

Hi, @OptimusV5 @EASTERNTIGER Could you try to remove the this line: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_ipc_utils.py#L42

Aug 21 '24 08:08 Kefeng-Duan

@EASTERNTIGER @OptimusV5 This bug is known and has been fixed in both the main branch and v0.12, you can validate it with the main branch now or wait for the v0.12 release.

Aug 21 '24 08:08 yuxianq

@EASTERNTIGER @OptimusV5 seems fixed here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/ipcUtils.cpp#L47, please update your code and verify

Aug 21 '24 09:08 Kefeng-Duan