cutile-python [FEA]: Efficient Intra-Tile Shift/Shuffle or Relaxed extract/cat Constraints

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request?

Critical (currently preventing usage)

Please provide a clear description of problem this feature solves

I am implementing a wavefront-style parallel algorithm where each thread needs to access the value computed by its neighbor in the previous step. Mathematically, this represents a "Shift Right" operation on a 1D Tile residing in registers: new_vec[i] = old_vec[i-1].

Currently, cuTile enforces strict Power-of-2 shape constraints on extract and cat. This makes it impossible to implement a shift by extracting the first $N-1$ elements and concatenating a boundary value (e.g., splitting a size-128 tile into a size-1 boundary and a size-127 slice is forbidden).

Real usage example: In stencil computations or dynamic programming wavefronts, data often flows diagonally or horizontally between threads. Without a register-level shift, developers are forced to use high-overhead workarounds:

Global Memory: Writing to global memory and reading back with an offset (scatter gather).
Matrix Multiplication: Constructing a shift matrix and using mma to perform the shift. This works but is computationally expensive (overkill) for a simple data movement operation.

Feature Description

As a high-performance kernel developer,
I want to efficiently shift or rotate elements within a Tile (intra-tile communication),
So that I can implement stencil and wavefront dependencies entirely within registers without incurring global memory latency or Tensor Core overhead.

Describe your ideal solution

I propose adding a dedicated primitive for intra-tile communication, which maps to efficient hardware instructions (like __shfl_up_sync or __shfl_down_sync in CUDA).

Proposed API:

# Shift elements to the right by 'shift_amount'.
# Elements shifted in are filled with 'fill_value'.
output_tile = ct.shift(input_tile, shift_amount=1, fill_value=0)

Alternative Solution: Relax the Power-of-2 constraint for ct.extract and ct.cat. If the library allowed operations on arbitrary shapes (e.g., extracting a size-127 tile), users could manually implement shifts via slicing and concatenation:

# Ideally, this should be allowed:
slice = ct.extract(val, index=(0,), shape=(127,))
boundary = ct.full((1,), 0, dtype=ct.int32)
shifted = ct.cat(boundary, slice, axis=0)

Describe any alternatives you have considered

No response

Additional context

No response

Contributing Guidelines

[x] I agree to follow cuTile Python's contributing guidelines
[x] I have searched the open feature requests and have found no duplicates for this feature request

Dec 10 '25 17:12 zzczzc20

Hello!

Thanks for your thoughtful writeup. There's a lot of overlap with what you want and what I've been exploring for supporting stencils within cuTile. I've been imagining either a shift operation or a rotate operation. I don't think it necessarily has to lower to shuffle instructions - it's just a change of indexing/CuTE layout algebra I believe.

We actually need a few different pieces:

A way to do overlapping loads across blocks (this can be done today in TileIR, but it's not clean).
A way to subdivide a tile into overlapping subtiles, e.g. either a rotate or a shift. This could be implementable today with scan operations in TileIR, but it would be inefficient.
A way to store non-power-of-2 subsets of a power-of-2 tile. Otherwise, you'll end up with conflicting writes.

What I have in mind is something like this:

@ct.kernel
def heat_equation_jacobi_2d_stencil(U, V, tx: ct.Constant[int], ty: ct.Constant[int]):
  TU = ct.load(A, index=(ct.bid(0), ct.bid(1)), shape=(tx, ty), strides=(tx - 2, ty - 2))

  S = ct.view.rotate(TU)

  center = ct.rotate(TU, offset=( 0,  0))
  north  = ct.rotate(TU, offset=( 0, -1))
  east   = ct.rotate(TU, offset=( 1,  0))
  west   = ct.rotate(TU, offset=(-1,  0))
  south  = ct.rotate(TU, offset=( 0,  1))

  TV = (center + north + east + west + south) * 0.2

  ct.store(A, index=(ct.bid(0), ct.bid(1)), tile=TV, bounding_box=((1, tx - 1), (1, ty - 1)))

Or a formulation with a "tile view" (similar to TileIR's partition_view) that simplifies things for multiple rotations on the same tile:

@ct.kernel
def heat_equation_jacobi_2d_stencil(U, V, tx: ct.Constant[int], ty: ct.Constant[int]):
  TU = ct.load(A, index=(ct.bid(0), ct.bid(1)), shape=(tx, ty), strides=(tx - 2, ty - 2))

  S = ct.view.rotate(TU)

  TV = (S[0, 0] + S[0, -1] + S[1, 0] + S[-1, 0] + S[0, 1]) * 0.2

  ct.store(A, index=(ct.bid(0), ct.bid(1)), tile=TV, bounding_box=((1, tx - 1), (1, ty - 1)))

Rest assured, this is very much on our radar.

I'd be interested to hear more about your use case and the sort of applications you're thinking about writing in cuTile.

As for relaxing the power-of-2 tile dimension constraint, that's something we'd like to do in the future, but it's a deep architectural change and it will take a while to figure out.

Dec 11 '25 23:12 brycelelbach

Hi @brycelelbach,

Thank you for the quick and detailed response!

The ct.rotate or ct.view API you proposed looks like an excellent solution for our problem. Being able to access neighbor elements via relative indexing (e.g., S[0, -1]) would allow us to express data dependencies naturally and efficiently, eliminating the need for heavy workarounds like Matrix Multiplication (MMA) or Global Memory round-trips just to move data between lanes.

Regarding the use case: I am working on high-performance scientific computing kernels that involve wavefront parallelism and dynamic programming. In these algorithms, the data dependency flows diagonally or horizontally through the grid, meaning each thread heavily relies on the results computed by its neighbors in the previous step. The current bottleneck is exactly the overhead of this intra-tile state propagation.

We are very much looking forward to this feature. Is there a rough timeline for when we might see this in a release or a development branch?

Thanks again!

Dec 12 '25 05:12 zzczzc20

We are very much looking forward to this feature. Is there a rough timeline for when we might see this in a release or a development branch?

We can't commit to a timeline right now, but it's coming soon.

Dec 12 '25 05:12 brycelelbach