Haicheng Wu
Haicheng Wu
@kerrmudgeon
@alihassanijr , could you please help with this?
you are correct. we will fix it next time upstream. thank you for catching this.
@jackkosaian , @apuaaChen could you please take a look?
@jackkosaian
group gemm is supported in the profiler. you could use cutlass profiler to pick the best kernel. cc += @ANIKET-SHIVAM
do you see different kernel get picked when changing sm count?
warp tile size k should be bigger than mma instruction k so that we can run multiple mma in the inner loop to use mma to hide other latencies. cutlass...
> dose it support fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling? yes
maybe ask this to pytorch? cc += @jackkosaian