Matt Wong

Results 1 issues of Matt Wong

This PR primarily creates optimized specializations of fused_add_rms_norm_kernel, used in many layernorms. It also includes a slightly optimized version of blockReduceSum/warpReduceSum which slightly reduce the number of shuffles done when...

action-required