cutlass [FEA] transpose in epilogue/prologue

Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibility to add the cutlass gemm kernel directly in my project without transpose kernels. If cutlass can provide examples to show how to fuse the transpose in the epilogue and "prologue" phases, overhead of the transpose kernels will be eliminated.

I did some tests to show the overhead of adding transpose kernels of an MxN matrix on A100-80G: m = 8, n = 4096, latency = 8.40 us m = 8, n = 1024, latency = 7.03 us m = 8, n = 14336, latency = 7.12 us m = 8, n = 4096, latency = 6.98 us m = 32, n = 4096, latency = 6.97 us m = 32, n = 1024, latency = 6.98 us m = 32, n = 14336, latency = 8.26 us m = 32, n = 4096, latency = 8.35 us m = 256, n = 4096, latency = 15.65 us m = 256, n = 1024, latency = 7.16 us m = 256, n = 14336, latency = 46.49 us m = 256, n = 4096, latency = 15.83 us m = 512, n = 4096, latency = 28.82 us m = 512, n = 1024, latency = 10.31 us m = 512, n = 14336, latency = 92.73 us m = 512, n = 4096, latency = 28.88 us

m = 4096, n = 8, latency = 7.07 us m = 1024, n = 8, latency = 8.18 us m = 14336, n = 8, latency = 7.52 us m = 4096, n = 8, latency = 8.34 us m = 4096, n = 32, latency = 8.35 us m = 1024, n = 32, latency = 7.83 us m = 14336, n = 32, latency = 9.65 us m = 4096, n = 32, latency = 8.34 us m = 4096, n = 256, latency = 15.90 us m = 1024, n = 256, latency = 8.18 us m = 14336, n = 256, latency = 46.24 us m = 4096, n = 256, latency = 15.90 us m = 4096, n = 512, latency = 26.69 us m = 1024, n = 512, latency = 10.29 us m = 14336, n = 512, latency = 88.09 us m = 4096, n = 512, latency = 26.65 us

It can be seen that when m or n is large, e.g. 14336, the transpose kernel hurts the e2e performance of a model.

Sep 04 '24 09:09 xiaonans

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

Sep 04 '24 14:09 thakkarV

are you using 2.x or 3.x API? in 3.x you should just be able to set your epilogue stride to whatever you want and it should just work

Thanks for your suggestion. I'm using cutlass 3.x.

I tried to find epilogue APIs with "stride" in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/epilogue.h. But I did not find parameters related to stride. Would you pls share more detailed hints on how to set epilogue stride?

In the above epilogue.h file, it can be seen that mma results are first stored in the shared memory and then loaded to registers before stored to the global memory. If I modify the iterators used in the epilogue.h file, it is really a huge work and I do not think that is the right to do.

Sep 06 '24 10:09 xiaonans

I'm confused. The file you point to is not a 3.x api citizen. @hwu36 can help maybe?

Sep 06 '24 16:09 thakkarV

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

Sep 06 '24 17:09 hwu36

What is your data type and hardware? If fp16 or bf16, A can be any layout on ampere.

My data type is fp16, and hardware is A100-80G.

I want to scatter the output on column, as described in this issue. After that, I need to feed the output to PyTorch where the tensor is assumed to be row-major.

If I add a transpose kernel to change the order from column-major to row-major before feeding into PyTorch, the overhead will hurt the e2e performance of a model.

I want to ask is there any method to add transpose in the epilogue, so that I can do scatter on the column and transpose the tensor to row-major order after that in a single kernel.

Sep 09 '24 02:09 xiaonans

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Oct 09 '24 14:10 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Jan 07 '25 14:01 github-actions[bot]