TensorRT-LLM
TensorRT-LLM copied to clipboard
fix: AllReduce CUDA Graph Fix + Kernel Clean up
This PR contains following changes:
- Remove all allreduce kernels from
customAllreduceKernels.cuexcept pre_post_norm fusion kernel. - Unify the workspace of old and new fusion kernel
- Fix a bug that allreduce kernel may produce wrong result with cuda graph enabled.
To merge this mr, we need to wait for the new allreduce kernels in #3064, and then refactor the current call site of allreduce.