Haicheng Wu

Results 323 comments of Haicheng Wu

you are correct. we will fix it next time upstream. thank you for catching this.

group gemm is supported in the profiler. you could use cutlass profiler to pick the best kernel. cc += @ANIKET-SHIVAM

do you see different kernel get picked when changing sm count?

warp tile size k should be bigger than mma instruction k so that we can run multiple mma in the inner loop to use mma to hide other latencies. cutlass...

> dose it support fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling? yes

maybe ask this to pytorch? cc += @jackkosaian