cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[BUG] Tmem tiled copy with non power-of-2 size fails to compile

Open tridao opened this issue 9 months ago • 6 comments
trafficstars

using namespace cute;
auto tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int<65536>{}, _1{}));
Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
auto load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

This fails to compile, with shape_div error. This is because make_tmem_copy calls make_cotile_copy, which calls left_inverse on the data layout. In this case left_inverse gives shape (65440, 1) to make it divisible by 160, but that leads to layout mismatch.

Is there another way I should be constructing the tmem tiled copy with non power-of-2 size?

Cc @thakkarV

tridao avatar Feb 10 '25 02:02 tridao

Hmmmm, this is a really interesting bug, thanks Tri.

The core issue seems to be in my lazy implementation of left_inverse, which is simply incorrect in that case.

Indeed, when I hack in the correct left-inverse layouts to make_cotiled_copy, a valid partitioner (the same partitioner) is returned that performs exactly as we would expect

  {
  Layout tmem_layout = make_layout(make_shape(_128{}, _128{}), make_stride(Int<65536>{}, _1{}));
  Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
  TiledCopy load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

  print(tmem_layout); print("\n");
  print(complement(tmem_layout)); print("\n");
  print(left_inverse(tmem_layout)); print("\n");
  print(load);
  print(load.get_slice(0).partition_S(A));  print("\n");
  }

  {
  Layout tmem_layout = make_layout(make_shape(_128{}, _160{}), make_stride(Int<65536>{}, _1{}));
  Tensor A = make_tensor(make_tmem_ptr<float>(0), tmem_layout);
  TiledCopy load = make_tmem_copy(SM100_TMEM_LOAD_32dp32b1x{}, A);

  print(tmem_layout); print("\n");
  print(complement(tmem_layout)); print("\n");
  print(left_inverse(tmem_layout)); print("\n");  // XXX Wrong
  print(load);
  print(load.get_slice(0).partition_S(A));   print("\n");
  }

prints

(_128,_128):(_65536,_1)
_512:_128
(_65536,_128):(_128,_1)
TiledCopy
  Tiler_MN:       ((_4,_32):(_32,_1),_1:_0)
  TiledLayout_TV: ((_32,_4),_32):((_0,_1),_4)
Copy_Atom
  ThrID:        _32:_1
  ValLayoutSrc: (_32,_32):(_0,_1)
  ValLayoutDst: (_32,_1):(_1,_1)
  ValLayoutRef: (_32,_32):(_0,_1)
  ValueType:    32b
tmem_[32b](0x0000.0000) o ((_32,_1),_1,_128):((_65536,_0),_0,_1)
(_128,_160):(_65536,_1)
_409:_160
_65440:_128
TiledCopy
  Tiler_MN:       ((_4,_32):(_32,_1),_1:_0)
  TiledLayout_TV: ((_32,_4),_32):((_0,_1),_4)
Copy_Atom
  ThrID:        _32:_1
  ValLayoutSrc: (_32,_32):(_0,_1)
  ValLayoutDst: (_32,_1):(_1,_1)
  ValLayoutRef: (_32,_32):(_0,_1)
  ValueType:    32b
tmem_[32b](0x0000.0000) o ((_32,_1),_1,_160):((_65536,_0),_0,_1)

Give me a little time to correct the implementation of left_inverse -- I'll put the patch here and get it merged asap.

ccecka avatar Feb 10 '25 03:02 ccecka

In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM tensor.

ccecka avatar Feb 10 '25 03:02 ccecka

In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM tensor.

Yup that's the workaround i'm using right now: creating a fake tmem tensor of size (128, 128) to get the tiled tmem_copy.

tridao avatar Feb 10 '25 03:02 tridao

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 12 '25 04:03 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Jun 10 '25 04:06 github-actions[bot]

This should just work now. Let me know if there are any issues with the normal pattern.

ccecka avatar Jun 10 '25 04:06 ccecka

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Sep 08 '25 05:09 github-actions[bot]