TensorRT-LLM feat: Low Precision Allreduce for PCIe based GPU

last PR：https://github.com/NVIDIA/TensorRT-LLM/pull/3851 last revet PR：https://github.com/NVIDIA/TensorRT-LLM/pull/4340

May 15 '25 02:05 kanghui0204

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

May 15 '25 02:05 hyukn

PR_Github #5257 [ run ] triggered by Bot

May 15 '25 03:05 tensorrt-cicd

/bot kill

May 15 '25 08:05 hyukn

PR_Github #5313 [ kill ] triggered by Bot

May 15 '25 08:05 tensorrt-cicd

PR_Github #5257 [ run ] completed with state ABORTED

May 15 '25 08:05 tensorrt-cicd

PR_Github #5313 [ kill ] completed with state SUCCESS Successfully killed previous jobs for commit 8614f2c

May 15 '25 08:05 tensorrt-cicd

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

May 16 '25 00:05 hyukn

PR_Github #5420 [ run ] triggered by Bot

May 16 '25 00:05 tensorrt-cicd

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

May 16 '25 02:05 EmmaQiaoCh

PR_Github #5420 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #3956 (Partly Tested) completed with status: 'SUCCESS'

May 16 '25 02:05 tensorrt-cicd

PR_Github #5437 [ run ] triggered by Bot

May 16 '25 02:05 tensorrt-cicd

PR_Github #5437 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #3968 (Partly Tested) completed with status: 'SUCCESS'

May 16 '25 05:05 tensorrt-cicd

/bot run --disable-fail-fast --add-multi-gpu-test

May 16 '25 05:05 hyukn

PR_Github #5460 [ run ] triggered by Bot

May 16 '25 05:05 tensorrt-cicd

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

May 16 '25 09:05 yizhang-nv

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

Do all our current kernels need to support CUDA graphs? I haven't tested these kernels on CUDA graphs.

May 16 '25 09:05 kanghui0204

/bot run --add-multi-gpu-test

May 16 '25 10:05 hyukn

PR_Github #5503 [ run ] triggered by Bot

May 16 '25 10:05 tensorrt-cicd

PR_Github #5460 [ run ] completed with state ABORTED

May 16 '25 10:05 tensorrt-cicd

PR_Github #5503 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4010 completed with status: 'FAILURE'

May 16 '25 20:05 tensorrt-cicd

/bot run --add-multi-gpu-test

May 17 '25 09:05 kanghui0204

/bot run

May 18 '25 06:05 EmmaQiaoCh

PR_Github #5596 [ run ] triggered by Bot

May 18 '25 06:05 tensorrt-cicd

PR_Github #5596 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4082 completed with status: 'FAILURE'

May 18 '25 08:05 tensorrt-cicd

/bot run --add-multi-gpu-test

May 19 '25 00:05 hyukn

PR_Github #5644 [ run ] triggered by Bot

May 19 '25 01:05 tensorrt-cicd

PR_Github #5644 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4124 completed with status: 'FAILURE'

May 19 '25 04:05 tensorrt-cicd

/bot run --add-multi-gpu-test

May 19 '25 04:05 hyukn

PR_Github #5673 [ run ] triggered by Bot

May 19 '25 04:05 tensorrt-cicd

PR_Github #5673 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4144 completed with status: 'SUCCESS'

May 19 '25 15:05 tensorrt-cicd