cutlass
cutlass copied to clipboard
b2b gemm residual
Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32.
@danthe3rd , would you please give it a try?
Looking quickly through the code it looks great! Thanks a lot for putting that up so fast :)
Unfortunately, I'll be away for the entire month of August. I can give it a try in September, I'll also try to find someone else to do it in my absence - but in the meantime feel free to merge. If it passes the tests for values of problemSize1.K
like [16, 32, 34, 36, 48, 64, 80]
, it should work for my usecase.
Once again thanks a lot :)
So I gave it a shot and it's working great! Thanks for putting this together :)
There is just one issue when problem_size_0_n > 32 && problem_size_0_n % 2 == 1
(for instance problem_size_0_n = 33
): it triggers an "CUDA Exception: Warp Misaligned Address" upon reading the shared-memory because it's not aligned on 32bits after the initial residual tile. I'm not sure if this is a specific case you want to support tho.
We have no itention to support small alignment in b2b gemm. It is not a good idea to apply b2b on not well aligned inputs. Padding matices is easy and can immediately bring good performance.