cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] how to use tma to store strided subtiles?

Open mammoth831 opened this issue 5 months ago • 2 comments

What is your question?

Suppose we are calculating a 4x4 tensor and we only have a 2x4 smem resource.

When the results are computed by different warp groups, e.g. the first two rows of the results are from warpgroup0 while the last two rows of the results are from warpgroup1, do we have to store them twice?

The first time each warp group writes half results to the smem, and the first thread of each warp group issues a 1x4 tma store. The second time each warp group writes another half results to the smem, and again issues a 1x4 tma store.

// 4x4 gtensor

       0    1    2    3
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |
    +----+----+----+----+
 1  |  4 |  5 |  6 |  7 |
    +----+----+----+----+
 2  |  8 |  9 | 10 | 11 |
    +----+----+----+----+
 3  | 12 | 13 | 14 | 15 |
    +----+----+----+----+

// 2x4 stensor - first time store

       0    1    2    3
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |   --> write to the 1st row of the gtensor
    +----+----+----+----+
 1  |  8 |  9 | 10 | 11 |   --> write to the 3rd row of the gtensor
    +----+----+----+----+

I tried to construct a proper tma copy but it seems there is no suitable API to do it. How to deal with this situation with tma?

#include <cute/tensor.hpp>
#include <cute/atom/copy_atom.hpp>

#include <vector>

using namespace cute;


int main() {
  using CopyOp = SM90_TMA_STORE;
  std::vector<float> d(16);
  auto gtensor = make_tensor(d.data(), make_layout(make_shape(4, 4), make_stride(4, _1{})));  // 4x4 results
  auto slayout = make_layout(make_shape(_1{}, _4{}), GenRowMajor{});    // 1x4 tma tile
  auto stensor = make_tensor(d.data(), make_layout(make_shape(_2{}, _4{}), GenRowMajor{})); // 2x4 smem

  // let cluster size =2  to extend tma thr layout, but it is used for multicast here and will lead to wrong slices and offsets when partitioning...
  auto tma = make_tma_copy(CopyOp{}, gtensor, slayout, Int<2>{}); 
  

  // How to get a tma tiled copy which can partition stensor and gtensor and do the copy like this:
  auto slice_tma = tma.get_slice(threadIdx.x / 128);
  auto tma_tensor = tma.get_tma_tensor();
  copy(tma.with(bar),  slice_tma.partition_S(stensor)(_, 0), slice_tma.partition_D(tma_tensor)(_,0))
  copy(tma.with(bar),  slice_tma.partition_S(stensor)(_, 1), slice_tma.partition_D(tma_tensor)(_,1))
} 

mammoth831 avatar Sep 22 '24 15:09 mammoth831