TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

RuntimeError: Can't enable access between nodes 1 and 0

Open EASTERNTIGER opened this issue 1 year ago • 1 comments

Hi, I tried to convert T5 model to tensorrt. I have a 4 GPUs devices.In the python convert_checkpoint.py step,I set tp_size=4,pp_size=1.Then I got tensorrt model successfully.However,when I use command :mpirun --allow-run-as-root -np 4 python3 run.py ,I got those errors

image when I set tp_size=1,pp_size=1 in the python convert_checkpoint.py step,I can run python3 run.py successfully. So how can I fixed this problem?It seems to be related with GPU setting,but I don't know how to do that. I also found a similar issue image but when I added --use_custom_all_reduce disable in trtllm-build,it showed unrecognized arguments image

EASTERNTIGER avatar Jul 31 '24 09:07 EASTERNTIGER

same problem,seems that this argument has been removed #2008

OptimusV5 avatar Aug 08 '24 01:08 OptimusV5

Hi, @OptimusV5 @EASTERNTIGER Could you try to remove the this line: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_ipc_utils.py#L42

Kefeng-Duan avatar Aug 21 '24 08:08 Kefeng-Duan

@EASTERNTIGER @OptimusV5 This bug is known and has been fixed in both the main branch and v0.12, you can validate it with the main branch now or wait for the v0.12 release.

yuxianq avatar Aug 21 '24 08:08 yuxianq

@EASTERNTIGER @OptimusV5 seems fixed here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/ipcUtils.cpp#L47, please update your code and verify

Kefeng-Duan avatar Aug 21 '24 09:08 Kefeng-Duan