cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] What's the concept of sk regions in streamK?

Open blueoyster6 opened this issue 10 months ago • 4 comments

What does an sk region denote in streamK and why is the condition (sk_blocks > sk_tiles) && (sk_blocks % sk_tiles == 0) needed for it to exist? Found here: sk region

blueoyster6 avatar Apr 13 '24 04:04 blueoyster6

@jackkosaian

hwu36 avatar Apr 24 '24 02:04 hwu36

sk_regions indicate the number of sub-partitions of the sk_tiles that will be covered by groups of stream-K blocks.

You can see that, by default, this value is 1: all stream-K blocks will collaborate to compute the whole space stream-K tiles (though not every stream-K block will compute to each stream-K tile).

The case in which sk_regions != 1 is when a split-K decomposition is selected (see here):

      if ((sk_blocks > sk_tiles) && (sk_blocks % sk_tiles == 0))
      {
        // Split-K decomposition
        sk_regions = sk_tiles;
      }

This condition indicates that a split-K decomposition is used because the number of stream-K tiles can be evenly divided amongst stream-K blocks. For example, if we have 4 stream-K blocks and 2 stream-K tiles, each stream-K tile can be computed via two stream-K blocks (one that computes the first half K iteration space and one that computes the second half). Thus, the number of "regions" of stream-K blocks that will collaborate together is equal to the number of sk_tiles.

jackkosaian avatar Apr 24 '24 20:04 jackkosaian

Thanks! That makes sense. Additionally, what's the concept of a cohort raster? And what is cohort CTA rasterization? See these lines in streamK.

blueoyster6 avatar May 06 '24 08:05 blueoyster6

The concept of a cohort is a structuring of the assignment of output tiles to CTAs that tries to achieve high L2 cache reuse. It's attempting to mirror the concept of CTA swizzling that's performed in non-stream-K CUTLASS kernels (e.g., assigning an 8x8 chunk of output tiles to a set of 64 CTAs, rather than a 64x1 or 1x64 chunk, so as to maximize L2 cache reuse).

Cohort rasterization in stream-K is attempting to regain the swizzling benefits that one might get from using one of CUTLASS's swizzling methods (e.g., Identity<8>), which cannot otherwise be used with the 2.x implementation of stream-K because we use the ThreadblockSwizzle template parameter to indicate that one should perform stream-K.

jackkosaian avatar May 06 '24 14:05 jackkosaian

Got it. Also, in

  1. lines, how did they choose the factors for iter, base, and peer costs??
  2. In line, what does epilogue accumulator fragments denote? How is it calculated and why are we launched n reduction blocks for n accum fragments for each sk tile?

blueoyster6 avatar May 14 '24 07:05 blueoyster6

lines, how did they choose the factors for iter, base, and peer costs??

via experiments.

In line, what does epilogue accumulator fragments denote? How is it calculated and why are we launched n reduction blocks for n accum fragments for each sk tile?

It is roughly the partial accumulators each thread holds. Each one of them need to go through final reduction to get the final result.

hwu36 avatar Sep 20 '24 21:09 hwu36