cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] Permute in K mode for consistent LDSM results

Open capybara-club opened this issue 10 months ago • 1 comments

If I run a cute layout in TN with DefaultCopy as the s2r atom, I get the exact same results as the ampere tf32 cutlass kernel. I verified that the cutlass kernel is using ldmatrix in the ptx. However, if I run a cute layout with LDSM as the s2r atom then the results differ (just enough to be explainable by fp error). I think the cutlass kernel must be doing a permutation in K to maintain a consistent result that exactly matches. Is this the extent of the difference? Is there an easy way to calculate what the permutation should be?

Thanks!

capybara-club avatar Feb 26 '25 19:02 capybara-club

@Junkai-Wu

hwu36 avatar Mar 04 '25 03:03 hwu36

I'm not quite sure what's your issue from the your description.

I think the cutlass kernel must be doing a permutation in K to maintain a consistent result that exactly matches Yes, cutlass TN kernel will permute in K dimension in loading data from global memory to shared memory and do the ldsm according to the permutation pattern. You can check detailed permutation pattern here (for K major 128 bit continuous): https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/mma_traits_sm90_gmma.hpp#L84

This is the shared memory swizzle layout. Composed with your thread layout, It will map the corresponding global memory position to shared memory position. If you are interested in swizzle details, you can check the implementation for struct Swizzle in cute.

Junkai-Wu avatar Mar 14 '25 09:03 Junkai-Wu

@Junkai-Wu

Thanks! I think you answered my question. Just wanted to hear that the permutation was enough to explain how the cutlass kernel maintained a consistent fp error. (my version without ldsm matching exactly cutlass yet cutlass using ldsm)

capybara-club avatar Mar 14 '25 09:03 capybara-club