Yi Zhang

Results 2 issues of Yi Zhang

This PR contains following changes: 1. Remove all allreduce kernels from `customAllreduceKernels.cu` except pre_post_norm fusion kernel. 2. Unify the workspace of old and new fusion kernel 3. Fix a bug...

This MR changes the current allreduce benchmark from TRT flow to PyTorch flow with cuda graph + norm fusion support