cutlass [QST] Permute in K mode for consistent LDSM results

If I run a cute layout in TN with DefaultCopy as the s2r atom, I get the exact same results as the ampere tf32 cutlass kernel. I verified that the cutlass kernel is using ldmatrix in the ptx. However, if I run a cute layout with LDSM as the s2r atom then the results differ (just enough to be explainable by fp error). I think the cutlass kernel must be doing a permutation in K to maintain a consistent result that exactly matches. Is this the extent of the difference? Is there an easy way to calculate what the permutation should be?

Thanks!

Feb 26 '25 19:02 capybara-club

@Junkai-Wu

Mar 04 '25 03:03 hwu36

I'm not quite sure what's your issue from the your description.

I think the cutlass kernel must be doing a permutation in K to maintain a consistent result that exactly matches Yes, cutlass TN kernel will permute in K dimension in loading data from global memory to shared memory and do the ldsm according to the permutation pattern. You can check detailed permutation pattern here (for K major 128 bit continuous): https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/mma_traits_sm90_gmma.hpp#L84

This is the shared memory swizzle layout. Composed with your thread layout, It will map the corresponding global memory position to shared memory position. If you are interested in swizzle details, you can check the implementation for struct Swizzle in cute.

Mar 14 '25 09:03 Junkai-Wu

@Junkai-Wu

Thanks! I think you answered my question. Just wanted to hear that the permutation was enough to explain how the cutlass kernel maintained a consistent fp error. (my version without ldsm matching exactly cutlass yet cutlass using ldsm)

Mar 14 '25 09:03 capybara-club