feat: Low Precision Allreduce for PCIe based GPU
last PR:https://github.com/NVIDIA/TensorRT-LLM/pull/3851 last revet PR:https://github.com/NVIDIA/TensorRT-LLM/pull/4340
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"
PR_Github #5257 [ run ] triggered by Bot
/bot kill
PR_Github #5313 [ kill ] triggered by Bot
PR_Github #5257 [ run ] completed with state ABORTED
PR_Github #5313 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 8614f2c
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"
PR_Github #5420 [ run ] triggered by Bot
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"
PR_Github #5420 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3956 (Partly Tested) completed with status: 'SUCCESS'
PR_Github #5437 [ run ] triggered by Bot
PR_Github #5437 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3968 (Partly Tested) completed with status: 'SUCCESS'
/bot run --disable-fail-fast --add-multi-gpu-test
PR_Github #5460 [ run ] triggered by Bot
Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.
For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422
Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.
For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422
Do all our current kernels need to support CUDA graphs? I haven't tested these kernels on CUDA graphs.
/bot run --add-multi-gpu-test
PR_Github #5503 [ run ] triggered by Bot
PR_Github #5460 [ run ] completed with state ABORTED
PR_Github #5503 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4010 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test
/bot run
PR_Github #5596 [ run ] triggered by Bot
PR_Github #5596 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4082 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test
PR_Github #5644 [ run ] triggered by Bot
PR_Github #5644 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4124 completed with status: 'FAILURE'
/bot run --add-multi-gpu-test
PR_Github #5673 [ run ] triggered by Bot
PR_Github #5673 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4144 completed with status: 'SUCCESS'