cutlass [QST] 2 GEMM fused result error

Also for this case. I try to use some other parameter to verity the result, such as cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_0(10, 64, 576); cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_1(10, 128, 64); it run ok, and pass the test.

However for this config cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_0(10, 256, 768); cutlass::gemm::GemmCoord gemm_f16_sm75_problem_size_1(10, 128, 256); it can run but the result failed as in line. I want to know how to debug this configure and does it have some implicit parameter which has to tuned for each case?

Apr 14 '22 13:04 lw921014

@jwang323

Apr 14 '22 13:04 hwu36

@lw921014 could you please post the thread block and warp tile sizes? In case you haven't tried it, please sanity check if the following requirements are met: problem_N = threadblock_N = warp_N

Apr 14 '22 16:04 jwang323

@lw921014 could you please post the thread block and warp tile sizes? In case you haven't tried it, please sanity check if the following requirements are met: problem_N = threadblock_N = warp_N

Thanks a lot. I know this requirent problem_N = threadblock_N = warp_N. My question is that both problem 1 and problem 2 shoud meet it at the same time? I mean problem0_N = threadblock0_N = warp0_N and problem1_N = threadblock1_N = warp1_N?

Apr 15 '22 01:04 lw921014

Correct.

Apr 15 '22 01:04 jwang323

Correct. Need problem0_N = problem1_N ?

Apr 15 '22 02:04 lw921014

problem0_N = threadblock0_N = warp0_N problem1_N = threadblock1_N = warp1_N

problem0_N doesn't have to be the same as problem1_N. It is only required that problem0_N = problem1_K.

Apr 15 '22 02:04 jwang323

Correct. For this group parameter, we satisfied problem0_N = threadblock0_N = warp0_N and problem1_N = threadblock1_N = warp1_N, but still failed.

Apr 15 '22 02:04 lw921014

You'll need the same number of warps for each GEMM.

In your example above, you use 4 warps for the 1st GEMM, but use 2 warps for the 2nd GEMM.

Apr 15 '22 02:04 jwang323

You'll need the same number of warps for each GEMM.

In your example above, you use 4 warps for the 1st GEMM, but use 2 warps for the 2nd GEMM.

I change this parameter to ensure each has same 4 warps. however failed. dc1a7b0f7d0cd3fd5638e88e614ffcd5

Apr 15 '22 02:04 lw921014

Your code snap shows the function run_nonfused_gemm_f16(). Did you also change the tile sizes for run_fused_gemm_f16()?

Apr 15 '22 03:04 jwang323

Here is a combination that works for me:

using ThreadblockShape0 = cutlass::gemm::GemmShape<64, 256, 32>; using WarpShape0 = cutlass::gemm::GemmShape<16, 256, 32>; using ThreadblockShape1 = cutlass::gemm::GemmShape<64, 128, 32>; using WarpShape1 = cutlass::gemm::GemmShape<16, 128, 32>; using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;

Note that this may not be performant compared with non-fused case due to small warp_M=16. Also the large warp_N=256 causes RF spill.

Apr 15 '22 03:04 jwang323

32

sorry. It is my mistake.

Apr 15 '22 04:04 lw921014

I change as this it run ok.

Note that this may not be performant compared with non-fused case due to small warp_M=16. Also the large warp_N=256 causes RF spill.

warp_N must be same as problem n. So for problem size which has a large n, do non-fused always better than fused? If not, can you give some advice to tuning the performance?

Apr 15 '22 04:04 lw921014

warp_N must be the same as problem_N

It is true for this example since we need the input A matrix of 2nd GEMM to be RF-resident, and each warp computes one warp tile with the entire input B matrix. You may reduce the warp1_N, e.g. warp1_N = 1/2 problem_N, but you'll need to change warp tile mapping so each warp computes two warp tiles at the N dimension.

The other solution is to use shared memory staging for the output of 1st GEMM so warp_N restriction can be relaxed. We will add an example in the future.

Apr 15 '22 13:04 jwang323

Also along the lines, you may hit shared memory size limit for large threadblock_N. On Turing, shared memory size is 64KB. Using threadblock size (64,256,32) as an example, you'll need (64x32+256x32)x(2 stages)x(2 bytes for fp16) = 40KB. If you increase threadblock_N, you will have to reduce threadblock_M or threadblock_K to reduce shared memory usage, which can lead to performance inefficiency.

Apr 15 '22 13:04 jwang323

Also along the lines, you may hit shared memory size limit for large threadblock_N. On Turing, shared memory size is 64KB. Using threadblock size (64,256,32) as an example, you'll need (64x32+256x32)x(2 stages)x(2 bytes for fp16) = 40KB. If you increase threadblock_N, you will have to reduce threadblock_M or threadblock_K to reduce shared memory usage, which can lead to performance inefficiency.

ok. Thanks a lot.

Apr 20 '22 11:04 lw921014

The other solution is to use shared memory staging for the output of 1st GEMM so warp_N restriction can be relaxed. We will add an example in the future.

You mean in 2.9?

Apr 20 '22 11:04 lw921014

@jwang323 I find another question about the result, for the problem size 0 (M = 1, N = 256, K = 64) and problem size 1 (M = 1, N = 128, K = 256), I run this example without change any parameters, the result is correct. However, when I use same problem size and change the A0, B0, C0, B1, C1 and fill them with all 0.1, and set the alpha0, alpha1, beta0, beta1 as all 1, then result show check failed. The data in [error_B2bGemm_device_fused.txt] shown as . Is this a bug?

I find in this line, EpilogueOutputOp0 is seted as cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling, does this ingore beta? And do we have some method to suport this case which C is a vector?

Apr 20 '22 11:04 lw921014

I find in this line, EpilogueOutputOp0 is seted as cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling, does this ingore beta?

Yes cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling only applies to epilogue with beta=0. If you have both alpha and beta, you'll need to set to set scale type to cutlass::epilogue::thread::ScaleType::Default, or cutlass::epilogue::thread::ScaleType::NoBetaScaling since your beta=1.

See https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/scale_type.h#L44

And do we have some method to suport this case which C is a vector?

Yes C can be treated as a vector if the stride is set to 0. You can check this example: https://github.com/NVIDIA/cutlass/blob/master/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu#L188

Apr 20 '22 13:04 jwang323

es cutlass::epilogue::thread::ScaleType::OnlyAlphaScaling only applies to epilogue with beta=0. If you have both alpha and beta, you'll need to set to set scale type to cutlass::epilogue::thread::ScaleType::Default, or cutlass::epilogue::thread::ScaleType::NoBetaScaling

when i set cutlass::epilogue::thread::ScaleType::Default as follow it trigger this error I trace it as follow code, Does it meas beta must be 0?

Apr 21 '22 03:04 lw921014

beta0 has to be 0 in the current implementation.

Apr 21 '22 03:04 hwu36

ok. beta0 must be 0 is a hard requirent? I mean it is a basic requirent for fused or beta can be set any value in future release(2.9)?

Apr 21 '22 03:04 lw921014

It is not going to change in 2.9

Apr 21 '22 03:04 hwu36

It is not going to change in 2.9

Would you please give me some reasons?

Apr 21 '22 03:04 lw921014

We haven't implemented it.

Apr 21 '22 03:04 hwu36

We haven't implemented it.

Is it possible in principle？Or only because may sacifice performance in current rf-resident mode?

Apr 21 '22 03:04 lw921014

It is doable, just we haven't done it yet. Do you want to add a mxn matrix or a per channel bias vector?

Apr 21 '22 03:04 hwu36

doable

Perfect. We need both mode. In fact, from view of upper application, mxn matrix can cover bias vector case, because you can broadcast bias vector in channel axis to obtain a mxn. So I think if we support adding a mxn matrix, we naturally support adding a vector. Of course, if we support both mode, it is best.

Apr 21 '22 03:04 lw921014

We haven't implemented it.

To be honestly, we hope you can support it soon, becase we can not ingnore this bias(c matix) in our real cases.

Apr 21 '22 03:04 lw921014

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

May 21 '22 04:05 github-actions[bot]