lijingticy22
lijingticy22
2 things should be fixed: 1. tile_to_shape order should be (1,0), meaning 2nd mode K is contiguous dimensio 2. tma_partition need use cute.group_modes(sA, 0, 2) for both sA and mA,...
3rd thing need be changed to make you pass mbarrier_wait is, change "if tidx == 0:" to "if tidx < 32:", internally in cute.copy implementation for tma_copy, we would have...
>Question: are there any docs on these things? Sorry, we do not yet have doc for `tma_partition`, we will work on it in next releases. For your question, the smem...
>It seems like sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (1, 0)) and >sA_wrong = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1)) This is because the contiguous dimension K in your case is exactly...
Looking at local_partition function definition in [here](https://github.com/NVIDIA/cutlass/blob/main/include/cute/tensor_impl.hpp#L1073), you will find index used to produce a coord into tile "tile.get_flat_coord(index)", in your case tile is (1,1):(0,0) layout, which means you can...