cutlass [QST] GemmArray with overlapping output chunks.

Hi!

I have a Batched Matrix Multiply problem with no fixed stride between batches. The minimalist example is the following (all the matrices are RowMajor): I want to calculate $O = W_0 X_0 + W_1 X_1$, where

$W_0 = \big[ [1, 1], [1, 1] \big]$, $W_1 = \big[ [-1, -1], [-1, -1] \big]$,

$X_0 = X_1 = \big[ 1, 1 \big]^T$

Thus, the correct answer is $O = \big[ 0, 0 \big]^T$

We can treat the problem through the GemmArray kernel. Below are the variables with the pointers to the beginning of batches:

  ptr0         ptr1
     V            V
W = [1, 1, 1, 1, -1, -1, -1, -1]

  ptr0  ptr1
     V     V
X = [1, 1, 1, 1]

  ptr1
  ptr0
     V
O = [.., ..]

Both batches are pointing to the same output location. Epilogue is chosen for the desired summation to happen (alpha=1, beta=1).

However, in practice, the desired outcome will never be obtained via GemmArray: instead, O will contain the results of the last threads participated. Something like this (with random results from run to run): $O = \big[ 2, 2 \big]^T$ $O = \big[ -2, -2 \big]^T$ $O = \big[ 2, -2 \big]^T$ $O = \big[ -2, 2 \big]^T$

CUBLAS library explicitly says that the problem statement is invalid: "O[i] matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, undefined behavior is expected."

Probably, there is the same warning somewhere in CUTLASS documentation.

The question: is there a way to get the desired output from the GemmArray in case of overlapping output chunks? Hopefully, without a dramatic loss of performance.

Thanks in advance!

Jul 18 '22 21:07 cydoroga

The same as cublas, your case is not supported by cutlass without any code change.

However, you can use semophore to control the order of storing the global memory. This method is used by serial splitk. Check

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h#L607-L618

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h#L657-L666

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/gemm_universal.h#L680-L695

Jul 19 '22 02:07 hwu36

Thank you a lot! Am I right, that the only reason I get wrong answer are the threadblocks using the memory that should have been locked? If I introduce a proper timing between threadblocks that are going to use the same output chunk, will they give me the sum in the result? Or, additionally, there is a need to modify output_op or something like that?

Jul 19 '22 12:07 cydoroga

Am I right, that the only reason I get wrong answer are the threadblocks using the memory that should have been locked? If I introduce a proper timing between threadblocks that are going to use the same output chunk, will they give me the sum in the result?

correct.

Or, additionally, there is a need to modify output_op or something like that?

You need to set beta of the 2nd one to 1 like this: https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L166

Jul 19 '22 13:07 hwu36

Hi! Eventually, I've implemented the version of GemmArray, that can handle overlapping outputs.

It required me to add the semaphore, workspace for the semaphore, and three new parameters:

overlap_multiplicity - the number of batches pointing to the same output (for now, works only if it is equal for all batches)
output_queue - array of integers that specify the order in which blocks will have access to the output. All blocks pointing to the same output should have different consecutive integers for the semaphore to work properly
to_which_output_batch - array that associates original output batch number to each batch. I mean, if you have 2K batches in A and B, and C have only K batches with overlap_multiplicity of 2, then the to_which_output_batch is an array of len 2K with values from {0, ..., K-1}

Currently, it looks somewhat complicated, so if you have any recommendations on how to improve the code, I can do it and open PR. If, of course, you are interested in that.

Further, I'm planning to implement a BlockSparse matmul, built on top of the GemmArray with overlapping outputs (in order to implement Pixelated Butterfly layer).

If you are interested in either both of them (GemmArrayOverlapping and PixelatedButterfly) or just in one, I can PR them.

Here are the files changed:

gemm/device/gemm_array.h gemm/kernel/gemm_array.h

Jul 23 '22 16:07 cydoroga

Congratulations!

If possible, maybe you can use blockIdx.z to decide the write order.

When you finish your code, you can first make your repository public. We can help guide people to your repository when they have the same need.

Jul 24 '22 01:07 hwu36

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sep 09 '22 10:09 github-actions[bot]

Hi @cydoroga . I'm working on implementing a similar shared output GemmArray and I found this issue. Could you consider making the patch public? It would be really helpful for people, like me, who have similar needs. Thank you!

Dec 27 '23 03:12 getianao

cutlass cutlass copied to clipboard

[QST] GemmArray with overlapping output chunks.

cutlass
cutlass copied to clipboard