fix: AllReduce CUDA Graph Fix + Kernel Clean up

Open yizhang-nv opened this issue 9 months ago • 0 comments

This PR contains following changes:

Remove all allreduce kernels from customAllreduceKernels.cu except pre_post_norm fusion kernel.
Unify the workspace of old and new fusion kernel
Fix a bug that allreduce kernel may produce wrong result with cuda graph enabled.

To merge this mr, we need to wait for the new allreduce kernels in #3064, and then refactor the current call site of allreduce.

Mar 25 '25 02:03 yizhang-nv