TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

feat: Low Precision Allreduce for PCIe based GPU

Open kanghui0204 opened this issue 7 months ago • 21 comments

last PR:https://github.com/NVIDIA/TensorRT-LLM/pull/3851 last revet PR:https://github.com/NVIDIA/TensorRT-LLM/pull/4340

kanghui0204 avatar May 15 '25 02:05 kanghui0204

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

hyukn avatar May 15 '25 02:05 hyukn

PR_Github #5257 [ run ] triggered by Bot

tensorrt-cicd avatar May 15 '25 03:05 tensorrt-cicd

/bot kill

hyukn avatar May 15 '25 08:05 hyukn

PR_Github #5313 [ kill ] triggered by Bot

tensorrt-cicd avatar May 15 '25 08:05 tensorrt-cicd

PR_Github #5257 [ run ] completed with state ABORTED

tensorrt-cicd avatar May 15 '25 08:05 tensorrt-cicd

PR_Github #5313 [ kill ] completed with state SUCCESS Successfully killed previous jobs for commit 8614f2c

tensorrt-cicd avatar May 15 '25 08:05 tensorrt-cicd

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

hyukn avatar May 16 '25 00:05 hyukn

PR_Github #5420 [ run ] triggered by Bot

tensorrt-cicd avatar May 16 '25 00:05 tensorrt-cicd

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

EmmaQiaoCh avatar May 16 '25 02:05 EmmaQiaoCh

PR_Github #5420 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #3956 (Partly Tested) completed with status: 'SUCCESS'

tensorrt-cicd avatar May 16 '25 02:05 tensorrt-cicd

PR_Github #5437 [ run ] triggered by Bot

tensorrt-cicd avatar May 16 '25 02:05 tensorrt-cicd

PR_Github #5437 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #3968 (Partly Tested) completed with status: 'SUCCESS'

tensorrt-cicd avatar May 16 '25 05:05 tensorrt-cicd

/bot run --disable-fail-fast --add-multi-gpu-test

hyukn avatar May 16 '25 05:05 hyukn

PR_Github #5460 [ run ] triggered by Bot

tensorrt-cicd avatar May 16 '25 05:05 tensorrt-cicd

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

yizhang-nv avatar May 16 '25 09:05 yizhang-nv

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

Do all our current kernels need to support CUDA graphs? I haven't tested these kernels on CUDA graphs.

kanghui0204 avatar May 16 '25 09:05 kanghui0204

/bot run --add-multi-gpu-test

hyukn avatar May 16 '25 10:05 hyukn

PR_Github #5503 [ run ] triggered by Bot

tensorrt-cicd avatar May 16 '25 10:05 tensorrt-cicd

PR_Github #5460 [ run ] completed with state ABORTED

tensorrt-cicd avatar May 16 '25 10:05 tensorrt-cicd

PR_Github #5503 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4010 completed with status: 'FAILURE'

tensorrt-cicd avatar May 16 '25 20:05 tensorrt-cicd

/bot run --add-multi-gpu-test

kanghui0204 avatar May 17 '25 09:05 kanghui0204

/bot run

EmmaQiaoCh avatar May 18 '25 06:05 EmmaQiaoCh

PR_Github #5596 [ run ] triggered by Bot

tensorrt-cicd avatar May 18 '25 06:05 tensorrt-cicd

PR_Github #5596 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4082 completed with status: 'FAILURE'

tensorrt-cicd avatar May 18 '25 08:05 tensorrt-cicd

/bot run --add-multi-gpu-test

hyukn avatar May 19 '25 00:05 hyukn

PR_Github #5644 [ run ] triggered by Bot

tensorrt-cicd avatar May 19 '25 01:05 tensorrt-cicd

PR_Github #5644 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4124 completed with status: 'FAILURE'

tensorrt-cicd avatar May 19 '25 04:05 tensorrt-cicd

/bot run --add-multi-gpu-test

hyukn avatar May 19 '25 04:05 hyukn

PR_Github #5673 [ run ] triggered by Bot

tensorrt-cicd avatar May 19 '25 04:05 tensorrt-cicd

PR_Github #5673 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4144 completed with status: 'SUCCESS'

tensorrt-cicd avatar May 19 '25 15:05 tensorrt-cicd