cutlass b2b gemm residual

b2b gemm residual

Open hwu36 opened this issue 1 year ago • 1 comments

Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32.

@danthe3rd , would you please give it a try?

Aug 02 '22 04:08 hwu36

Looking quickly through the code it looks great! Thanks a lot for putting that up so fast :) Unfortunately, I'll be away for the entire month of August. I can give it a try in September, I'll also try to find someone else to do it in my absence - but in the meantime feel free to merge. If it passes the tests for values of problemSize1.K like [16, 32, 34, 36, 48, 64, 80], it should work for my usecase. Once again thanks a lot :)

Aug 02 '22 19:08 danthe3rd

So I gave it a shot and it's working great! Thanks for putting this together :) There is just one issue when problem_size_0_n > 32 && problem_size_0_n % 2 == 1 (for instance problem_size_0_n = 33): it triggers an "CUDA Exception: Warp Misaligned Address" upon reading the shared-memory because it's not aligned on 32bits after the initial residual tile. I'm not sure if this is a specific case you want to support tho.

Aug 14 '22 19:08 danthe3rd

We have no itention to support small alignment in b2b gemm. It is not a good idea to apply b2b on not well aligned inputs. Padding matices is easy and can immediately bring good performance.

Aug 15 '22 02:08 hwu36

cutlass cutlass copied to clipboard

b2b gemm residual

cutlass
cutlass copied to clipboard