cutlass [BUG] make_tiled_copy should not assume 2d data, Thr and Val layouts

trafficstars

Describe the bug make_tiled_copy also should not secretly pad Thr and Val. See code sample and discussion.

Steps/Code to reproduce bug

#include <cute/tensor.hpp>
using namespace cute;
int main() {
  std::vector<int> tensor_buffer(12);
  for (int i = 0; i < 12; i++) {
    tensor_buffer[i] = i;
  }
  {
    print("====================================== case 1 ======================================\n");
    auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{})));
    Layout<Shape<_2>, Stride<_1>> thr;
    Layout<Shape<_2>, Stride<_1>> val;
    auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
    auto thr_copy = tiled_copy.get_slice(1);
    // auto view = thr_copy.partition_S(t);  // static_assert(R >= rank_v<TiledShape_MN>, "Rank of tensor to be partitioned too small.");
    // print_tensor(view);
  }

  {
    print("====================================== case 2 ======================================\n");
    auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_1{}, _12{})));
    Layout<Shape<_1, _2>, Stride<_2, _1>> thr;
    Layout<Shape<_1, _2>, Stride<_2, _1>> val;
    auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
    auto thr_copy = tiled_copy.get_slice(1);
    auto view = thr_copy.partition_S(t);
    print_tensor(t);
    // ptr[32b](0x55adbfc83eb0) o (_1,_12):(_0,_1):
    //     0    1    2    3    4    5    6    7    8    9   10   11
    print_tensor(view);
    // ptr[32b](0x55adbfc83eb8) o ((_1,_2),_1,_3):((_0,_1),_0,_4):
    // ptr[32b](0x55adbfc83eb8) o ((_1,_2),_1):((_0,_1),_0):
    //     2
    //     3
    // -----
    // ptr[32b](0x55adbfc83ec8) o ((_1,_2),_1):((_0,_1),_0):
    //     6
    //     7
    // -----
    // ptr[32b](0x55adbfc83ed8) o ((_1,_2),_1):((_0,_1),_0):
    //    10
    //    11
  }

  {
    print("====================================== case 3 ======================================\n");
    auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{}, _1{})));
    Layout<Shape<_2>, Stride<_1>> thr;
    Layout<Shape<_2>, Stride<_1>> val;
    auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, int>{}, thr, val);
    auto thr_copy = tiled_copy.get_slice(1);
    auto view = thr_copy.partition_S(t);
    print(append<2>(thr, Layout<_1>{}));
    // (_2,_1):(_1,_0)
    print_tensor(t);
    // ptr[32b](0x556b715e0eb0) o (_12,_1):(_1,_0):
    //     0
    //     1
    //     2
    //     3
    //     4
    //     5
    //     6
    //     7
    //     8
    //     9
    //    10
    //    11
    print_tensor(view);
    // ptr[32b](0x556b715e0eb0) o ((_1,_2),_6,_1):((_0,_1),_2,_0):
    // ptr[32b](0x556b715e0eb0) o ((_1,_2),_6):((_0,_1),_2):
    //     0    2    4    6    8   10
    //     1    3    5    7    9   11
  }

  {
    print("====================================== case 4 ======================================\n");
    auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{}, _1{})));
    Layout<Shape<_2, _1>, Stride<_1, _2>> thr;
    Layout<Shape<_2, _1>, Stride<_1, _2>> val;
    auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
    auto thr_copy = tiled_copy.get_slice(1);
    auto view = thr_copy.partition_S(t);
    print_tensor(t);
    // same as case 3
    print_tensor(view);
    // ptr[32b](0x55b4e4864eb8) o ((_1,_2),_3,_1):((_0,_1),_4,_0):
    // ptr[32b](0x55b4e4864eb8) o ((_1,_2),_3):((_0,_1),_4):
    //     2    6   10
    //     3    7   11
  }

  return 0;
}

Expected behavior

Note, the tensor is 1d for some reason, threads access data in interleaved style, in multiple iters.

case 1

Case 1 is a perfect use case of "compose then slice" of 1d Tensor, 1d Thr and 1d Val all along the same direction. The hidden append in impl prevents make_tiled_copy being useful in this case.

  auto thr_layout_mn  = append<R>(thr_layout, Layout<_1>{});
  auto val_layout_mn  = append<R>(val_layout, Layout<_1>{});

case 2

Case 2 is what we want and it works as expected. It works because we make everything explict and correctly express the ording.

case 3

case 3 is extremely weird and confusing due to the append<R>, it produces (_2,_1):(_1,_0) from _2:_1. It produces incomprehensible iteraion mode. This should not happen...

case 4

case 4 is what case 3 should have been. It is a manually fixed version of case 3.

Environment details (please complete the following information): a75b4ac483166189a45290783cb0a18af5ff0ea5

Dec 27 '23 17:12 cloudhan

@ccecka

Dec 27 '23 17:12 thakkarV

In summary, the secret behaviour limit the usage of make_tiled_copy, in some edge cases, it produce incomprehensible and unexpected "incorrect" result.

Dec 27 '23 17:12 cloudhan

The problem appears to be with blocked_product and raked_product rather than the append.

raked_product(Layout<Shape<_2>>{}, Layout<Shape<_2>>{})

and

raked_product(Layout<_2>{}, Layout<_2>{})

should be explicitly rank-1 rather than rank-2 (and implicitly intended to be interpreted as 1-D). Both blocked_product and raked_product are rather unique because they are rank-sensitive and the rank of the result should be equal to the maximum of the rank of the inputs. That would fix the confusion cases.

I'll include this in my documentation+pre/postconditions update coming shortly.

You can use

print(tiled_copy);

to get more information and potentially use make_tiled_copy_impl to specify the Tiler and TV_Layout directly.

Dec 28 '23 01:12 ccecka

I happen to create my own utility function:

#include "cute/tensor.hpp"
using namespace cute;
template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor&& tensor, const ThrLayout& thr_layout = {}, const ValLayout& val_layout = {}) {
  auto layout = raked_product(thr_layout, val_layout);
  auto with_iter = zipped_divide(tensor.layout(), layout);
  auto layout_per_iter = get<0>(with_iter);
  auto iter_layout = get<1>(with_iter);
  auto layout_tv_iter = make_layout(
    right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
    iter_layout
  );
  return std::forward<Tensor>(tensor).compose(layout_tv_iter);
}


int main() {
  std::vector<int> tensor_buffer(12);
  for (int i = 0; i < 12; i++) {
    tensor_buffer[i] = i;
  }

  {
    print("====================================== case 5 ======================================\n");
    // auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_4{}, _3{})));
    auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{})));
    Layout<Shape<_2>, Stride<_1>> thr;
    Layout<Shape<_2>, Stride<_1>> val;
    print_tensor(make_tv_view(t, thr, val)(make_coord(0, _), _));
  }
}

cute should somehow provide a utility function like this.

Dec 28 '23 02:12 cloudhan

If it works for you, that's great. I'll take a closer look soon.

Until then, it appears it can be simplified to

template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor        && tensor, 
                  ThrLayout const& thr_layout = {},   // (m,n,...) -> thr_idx
                  ValLayout const& val_layout = {})   // (m,n,...) -> val_idx
{
  auto layout_mn = raked_product(thr_layout, val_layout);
  auto layout_tv = right_inverse(layout_mn).with_shape(make_shape(size(thr_layout), size(val_layout)));
  return zipped_divide(tensor, layout_mn).compose(layout_tv, _);
}

Dec 29 '23 05:12 ccecka

Great to know that I can use underscore in compose.

Dec 29 '23 07:12 cloudhan

EDIT: removed unnecessary compose

@ccecka The .compose(layout_tv, _) does not work for 3d case. More work for you now =).

#include <cute/tensor.hpp>

using namespace cute;

template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout(
    const Layout& target,              // the target layout we want to compose with
    const ThrLayout& thr_layout = {},  // (t0,...) -> thr_idx
    const ValLayout& val_layout = {}   // (v0,...) -> val_idx
) {
  auto layout = raked_product(thr_layout, val_layout);
  auto with_iter = zipped_divide(target, layout);
  auto layout_per_iter = get<0>(with_iter);
  auto iter_layout = get<1>(with_iter);
  auto layout_tv_iter = make_layout(
      right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
      iter_layout
  );
  return layout_tv_iter;
}

template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout_ccecka(
    const Layout& target,              // the target layout we want to compose with
    const ThrLayout& thr_layout = {},  // (t0,...) -> thr_idx
    const ValLayout& val_layout = {}   // (v0,...) -> val_idx
) {
  auto layout = raked_product(thr_layout, val_layout);  // (t0*v0,t1*v1,...) -> (thr_idx,val_idx)
  auto layout_tv = right_inverse(layout).with_shape(make_shape(size(thr_layout), size(val_layout)));
  return zipped_divide(target, layout).compose(layout_tv, _);
}

int main() {
  constexpr int XSize = 16;
  constexpr int YSize = 64;
  int zsize = 32;  // each warp process a seperate z idx, zsize is a runtime value

  // The plan
  constexpr int XVec = 4;
  constexpr int YVec = 8;
  constexpr int NumXVec = 4;
  constexpr int NumYVec = 8;

  // make_tile cause per mode product
  auto layout = make_layout(make_shape(make_shape(make_shape(Int<XSize>{}, Int<YSize>{})), zsize));
  auto tcoord = make_identity_tensor(layout.shape());
  auto [layout_tvi_mine, layout_tvi_ccecka] = [&]() {
    constexpr int NumThreads = 128;
    constexpr int WarpSize = 32;
    static_assert(NumXVec * NumYVec == WarpSize);
    constexpr int NumWarps = NumThreads / WarpSize;
    constexpr auto thr_layout = make_layout(make_shape(make_shape(Int<NumXVec>{}, Int<NumYVec>{}), Int<NumWarps>{}));
    constexpr auto val_layout = make_layout(make_shape(make_shape(Int<XVec>{}, Int<YVec>{}), _1{}));
    //  X  Y  Z
    // 16 64 32 target tensor
    //  4  8  4 thr_layout
    //  4  8  1 val_layout
    // 16 64  4 raked_product(thr_layout, val_layout)
    //  1  1  8 iter
    print("thr_layout:"), print(thr_layout), print("\n");
    print("val_layout:"), print(val_layout), print("\n");
    return std::make_pair(make_tv_layout(layout, thr_layout, val_layout), make_tv_layout_ccecka(layout, thr_layout, val_layout));
  }();

  // Let see mapped coords of thread 0
  auto t0_mine = tcoord.compose(layout_tvi_mine)(make_coord(0, _), _);
  auto t0_ccecka = tcoord.compose(layout_tvi_ccecka)(make_coord(0, _), _);

  print("size (mine): "), print(size(t0_mine)), print("size (ccecka): "), print(size(t0_ccecka)), print("\n");
  for(int i=0;i<size(t0_mine); i++) {
    print(i), print("\tcoord (mine): "), print(t0_mine(i)), print("\tcoord (ccecka): "), print(t0_ccecka(i)), print("\n");
  }
  // size (mine): 256size (ccecka): 256
  // 0       coord (mine): (((0,0)),0)       coord (ccecka): (((0,0)),0)
  // 1       coord (mine): (((1,0)),0)       coord (ccecka): (((0,8)),0)
  // 2       coord (mine): (((2,0)),0)       coord (ccecka): (((0,16)),0)
  // 3       coord (mine): (((3,0)),0)       coord (ccecka): (((0,24)),0)
  // 4       coord (mine): (((4,0)),0)       coord (ccecka): (((0,32)),0)
  // 5       coord (mine): (((5,0)),0)       coord (ccecka): (((0,40)),0)
  // 6       coord (mine): (((6,0)),0)       coord (ccecka): (((0,48)),0)
  // 7       coord (mine): (((7,0)),0)       coord (ccecka): (((0,56)),0)
  // 8       coord (mine): (((8,0)),0)       coord (ccecka): (((0,0)),1)
  // 9       coord (mine): (((9,0)),0)       coord (ccecka): (((0,8)),1)
  // 10      coord (mine): (((10,0)),0)      coord (ccecka): (((0,16)),1)
  // 11      coord (mine): (((11,0)),0)      coord (ccecka): (((0,24)),1)
  // 12      coord (mine): (((12,0)),0)      coord (ccecka): (((0,32)),1)
  // 13      coord (mine): (((13,0)),0)      coord (ccecka): (((0,40)),1)
  // 14      coord (mine): (((14,0)),0)      coord (ccecka): (((0,48)),1)
  // 15      coord (mine): (((15,0)),0)      coord (ccecka): (((0,56)),1)

  return 0;
}

Dec 30 '23 15:12 cloudhan

I found this because I applied the simplification, and it works pretty well for 1d and 2d case. For 3d, my sliced tensor is pretty weird and I scratched my head for a whole day before I finally decide to try the old version of tv layout maker.

Dec 30 '23 15:12 cloudhan

I don't know what you mean by "works". Neither approach appears to be giving you what you actually intend. This is why I don't recommend attempting to reinvent partitioning.

The difference between the two is that "your's" is composing with

  auto layout_tv_iter = make_layout(
      right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
      iter_layout
  );

which is a Layout, and "mine" is composing with

  make_tile(layout_tv, _)

which is a Tile and applied by-mode.

The other problems that are going to prevent this from generalizing are

raked_product is rank-sensitive and not applicable hierarchically. raked_product( ((X,Y),Z), ((A,B),C) ) computes the "product" of (X,Y) and (A,B), not of X and A then Y and B.
By applying zipped_divide to the raked_product, you're asking for the composition of incompatible objects. The LHS is a tensor with logical coordinates and the RHS is a layout mapping something to (thr_idx,val_idx). The (thr_idx, val_idx) are not intended to be logical coordinates into the tensor, so the composition doesn't make sense.
The iter_layout being composed with the original tensor doesn't make much sense either.

For these kinds of partitions, you have to ask yourself what you actually care about

Do you care about the values assignment? It appears you do, but only for vectorization reasons?
Do you actually care about the thread assignment? It appears you do not
Are these partitioning patterns specified by an MMA_Op or a Copy_Op?
Are your tensors static or dynamic or both?
Do you need to apply the same partitioning pattern to multiple tensors?

The pattern that CuTe uses for partitioning is a Tiler o LayoutTV operator, which you've seen in the TiledMMA and TiledCopy and you could find a more naked version in partition.hpp.

template <class Layout, class Tiler, class LayoutTV>
__host__ __device__ constexpr auto
tv_tiling(Layout   const& target,       // the target layout we want to compose with
          Tiler    const& tiler,        // the tiler to extract out relevant elements
          LayoutTV const& layout_tv)    // the TV transform to apply to the tile of data
{
  return zipped_divide(target, tiler).compose(layout_tv, _);
}

int main() {
  auto layout = make_layout(make_shape(make_shape(make_shape(Int<16>{}, Int<64>{})), 32));
  auto tcoord = make_identity_tensor(layout.shape());

  // Select out of the target the tile of data we want to work on
  auto tiler = Shape<Shape<Shape<_16,_64>>, _4>{};   // (M,N,L) -> tensor coords
  // Transform the tile of data to a TV format
  auto thr_layout = Layout<Shape<_4,_8,_4>>{};       // (m,n,l) -> thr_idx
  auto val_layout = Layout<Shape<_4,_8,_1>>{};       // (m,n,l) -> val_idx
  // (M,N,L) -> (thr_idx,val_idx)
  auto thrval_layout = raked_product(thr_layout, val_layout);
  // (thr_idx, val_idx) -> (M,N,L)
  auto layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout));

  // We've construct the layout_tv to index into the coords of the tiler
  CUTE_STATIC_ASSERT_V(cosize(layout_tv) == size(tiler));

  auto result = tv_tiling(tcoord, tiler, layout_tv);   // ((thr, val), iter)

  // Let see mapped coords of thread 0
  auto t0_result = result(make_coord(0, _), _);
  for(int i = 0; i < size(t0_result); ++i) {
    print(i), print("\t"), print(t0_result(i)), print("\n");
  }
  //  0       (((0,0)),0)      // Each thread gets a (((4,8)),1) "block" of data 8 times in the target
  //  1       (((1,0)),0)
  //  2       (((2,0)),0)
  //  3       (((3,0)),0)
  //  4       (((0,1)),0)
  //  5       (((1,1)),0)
  //  6       (((2,1)),0)
  //  7       (((3,1)),0)
  //  8       (((0,2)),0)
  //  9       (((1,2)),0)
  //  10      (((2,2)),0)
  //  11      (((3,2)),0)
  //  12      (((0,3)),0)
  //  13      (((1,3)),0)
  //  14      (((2,3)),0)
  //  15      (((3,3)),0)
  // ...
}

This is exactly the interface of make_tiled_copy_impl that I previously recommended. My upcoming documentation update includes sections on Tilers and products.

Dec 30 '23 17:12 ccecka

I start to see some blind spots that was ignore by me previously.

The very important one is in zipped_divide(target, tiler).compose(layout_tv, _) that I always want to derive the tiler from thr_layout and val_layout. This is why I always rely on the raked_product instread of being explicit on tiler.

derived_tiler = raked_product(thr_layout, val_layout);  // this the confused part, it should be explicit, however.
thrval_layout = raked_product(thr_layout, val_layout);                                   // forward mapping
layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout))  // reversed mapping

In my previous example, each warp process a whole XY slice, each warp iter over 8 Z indices, by relying on raked_product, the only thing I can achieve is each warp iterating over 8 consecutive Z indices, and I was wondering how to achive itering over 8 strided Z indices, I know how to do it now: throw away the derived one and design the tiler also.

I also see some blind spots of yours now.

By applying zipped_divide to the raked_product, you're asking for the composition of incompatible objects.

I wouldn't call it incompatible, by my thr and val_layout design, the result of raked_product should cover the whole XY slice, thus zipped_divide should only produce iter mode over the Z indices (1,1,8), because it is not fully covered. But if you imagine XY grows from 16x64 to something like 17x64, then iter mode grows from (1,1,8) to (2,1,8). This totally makes sense because of the over-approximating nature.

The iter_layout being composed with the original tensor doesn't make much sense either.

Sorry for the confusion, this is some uncleaned code from copying, the target.compose(layout_tv_iter) is nop if you read it as a whole of tensor.compose(tensor_layout.compose(layout_tv_iter)), it should be cleaned up as tensor.compose(layout_tv_iter), It just happens to not create errors.

Dec 31 '23 03:12 cloudhan

For rank-sensitivity of raked_product, I'd imagine an utility record hierarchy -> flatten -> product -> unflatten with recorded hierarchy, but may not be useful...

Dec 31 '23 03:12 cloudhan

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 19 '24 16:02 github-actions[bot]

Closing due to inactivity

Feb 22 '24 15:02 mnicely

@mnicely Sorry, but I think this is still relevant. Inactivity does not imply it is solved.

Feb 22 '24 15:02 cloudhan

The bugs I pointed out in https://github.com/NVIDIA/cutlass/issues/1284#issuecomment-1870749414 were fixed with the 3.4 docs update. Those fixes strengthen and improve the post-conditions of rank-sensitive raked_product and blocked_product.

The rest of the concepts in this thread remain intact, but the functionality of the original code should be more consistent with expectations.

Feb 22 '24 16:02 ccecka

I manually verified with lastest main@ bbe579a9e3beb6ea6626d9227ec32d0dae119a49, The case 1 works now (aka, for 1d data Thr and Val). The case 3 works as expected now (aka, for 2d data with 1d Thr and Val with automatic padding). I'd assume higher rank cases will work as expected also.

@ccecka I had been thinking that this was caused by the design that it intrinsically only supported 2d, But it turn out to be a bug. Sorry for misunderstanding.

Feb 23 '24 01:02 cloudhan

cutlass cutlass copied to clipboard

[BUG] make_tiled_copy should not assume 2d data, Thr and Val layouts

case 1

case 2

case 3

case 4

cutlass
cutlass copied to clipboard