Yi Zhang
Results
2
issues of
Yi Zhang
This PR contains following changes: 1. Remove all allreduce kernels from `customAllreduceKernels.cu` except pre_post_norm fusion kernel. 2. Unify the workspace of old and new fusion kernel 3. Fix a bug...
This MR changes the current allreduce benchmark from TRT flow to PyTorch flow with cuda graph + norm fusion support