Matt Wong
Results
1
issues of
Matt Wong
This PR primarily creates optimized specializations of fused_add_rms_norm_kernel, used in many layernorms. It also includes a slightly optimized version of blockReduceSum/warpReduceSum which slightly reduce the number of shuffles done when...
action-required