cutlass [QST] Strided dgrad conv epilogue does not use fast divmod

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h#L315-L318

This piece of code can be replaced by using fast divmod. The same can be applied to the store function below.

Mar 28 '24 14:03 ZelboK

@hwu36

Mar 28 '24 14:03 ZelboK

@manishucsd

Mar 28 '24 14:03 hwu36

example code is https://github.com/NVIDIA/cutlass/pull/1437/files#diff-3e442e06009337880fcf6b4f9286783994269e20d10392afa4a1fa981626376dR388-R390

Mar 28 '24 14:03 hwu36

https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md

also our gtc cutlass talks mentioned the basic concepts, conv, strided dgrad in details.

Mar 28 '24 14:03 hwu36

Yes, you can replace these lines with fast_divmod. I remember doing this change and surprisingly the performance dropped a couple of TFLOP/s. Feel free to make a change and report before/after performance numbers.

Direct link to strided dgrad GTC talk.

Mar 28 '24 16:03 manishucsd

Yes, you can replace these lines with fast_divmod.

hi, the p_ and q_ are computed from start_r and start_s (start_r->start_h_->p_), while start_r and start_s are decided by the thread block idx which is dynamic.

Do you mean to construct the fast_divmod object on the device directly? If so, I think maybe we have to do the raw divmod here. Int32 divmod only takes about 20 instructions which is much cheaper than calculating the multiplier and shift of fast_divmod.

Mar 28 '24 17:03 mammoth831

On my 3080 BEFORE line: int n = npq_offset / (p_ * q_); translates to before_first_line_sass.txt line: int residual = npq_offset % (p_ * q_); translates to before_second_line_sass.txt

(i'll omit the other two lines assembly for brevity for now) AFTER this code:

params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);

leads to

2651	0000000f 00c699a0	      ISETP.NE.AND P4, PT, R9, 0x1, PT 	133	0	0									


2720	0000000f 00c69df0	      ISETP.NE.AND P0, PT, R42, 0x1, PT 	149	0	0									
2721	0000000f 00c69e00	      IMAD.MOV.U32 R40, RZ, RZ, R11 	150	0	0									
2722	0000000f 00c69e10	@P0   IMAD.HI.U32 R2, R40, R2, RZ 	149	0	0									
2723	0000000f 00c69e20	      MOV R11, R7 	150	0	0									
2724	0000000f 00c69e30	      IMAD.MOV.U32 R7, RZ, RZ, R0 	150	0	0									
2725	0000000f 00c69e40	      IMAD.MOV.U32 R0, RZ, RZ, R40 	150	0	0									
2726	0000000f 00c69e50	@P0   SHF.R.U32.HI R0, RZ, R43, R2 	150	0	0

assembly

Last 3 columns are: Live Registers, Warp Stall Sampling, Instructions Executed

the FastDivmod was formed like this:

    params_.divmod = FastDivmod(p_*q_);
    params_.divmod_two = FastDivmod(params_.problem_size.Q);

all tests pass. @hwu36

Tangential: Does anyone know of a more convenient way to extract relevant assembly in ncu-ui? I like how it will correlate your source code to the assembly but it doesn't give you an option to extract exclusively those lines... Feels like an easy feature

Apr 03 '24 02:04 ZelboK

Like @manishucsd has stated I am consistently seeing that these changes are performing more poorly. edit: If I refactor store_with_byte_offset to also use FastDivmod the performance improves (GFLOPS)

original | load only | Load & store
--      | --       | --
42438.4 | 42347.2 | 44057.6
45593.2 | 40910 | 47485.4
46520.2 | 35192.5 | 45204.9
51508.7 | 50302.6 | 51564
50227 | 49132.7 | 50302.6
46430.3 | 45555.7 | 46096.4
50418 | 47765.6 | 49985.4
37307.6 | 34911.3 | 37978.4
3561.23 | 3234.04 | 3644.71
47676.7 | 47130.3 | 49490
29084.6 | 28995.8 | 29655.5
3403.02 | 3120.96 | 3496.45
52486.1 | 49904.3 | 51803.7
50597.1 | 48242 | 50220.4
36889 | 38074.3 | 39168.1
53879.7 | 56330.6 | 56537.7
49410.5 | 51800.2 | 52172.7
44888.3 | 40922.8 | 46548.4
48512.8 | 47529.4 | 50144.7
43376.9 | 41792.6 | 43077.8
34642.4 | 33419.9 | 34349.9
50375.1 | 48202.7 | 49962.7
41098 | 41014.7 | 42323.9
33170.5 | 33063.6 | 33800.4
44329.3 | 42541.7 | 43949.3
41890.5 | 38951.8 | 41537.5
30802.9 | 29376.8 | 30572.7
45720.9 | 43954.4 | 47196.9
43627.7 | 42055.4 | 45114.7
39782 | 38482.1 | 39485.8
35453.7 | 33411.2 | 37222.9
33715.9 | 30371.1 | 33430.1
26631.1 | 24955.4 | 27061.2
47365.5 | 47304.3 | 49073.2
40151.7 | 36906.7 | 41618.4
6243.1 | 6177.41 | 6145.52
44085.4 | 42379.9 | 45434.6
40475 | 39074.5 | 41674.8
8948.19 | 8967.81 | 8928.76
47597.1 | 44222.3 | 47315.9
43240.1 | 41711 | 44627.8
37502.5 | 36286.8 | 38464.7
37327.7 | 34334.6 | 37970.9
36970.6 | 33005.4 | 35697.2
26558.4 | 25751.8 | 27999.2

from running

./tools/profiler/cutlass_profiler  --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=256 --k=256 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

with these variables

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS="cutlass_tensorop_f16_s16816dgrad_optimized_f16_*" -DCMAKE_CUDA_FLAGS="-lineinfo"

(no debug flag)

Apr 03 '24 02:04 ZelboK

Thank you @ZelboK.

I refactor store_with_byte_offset to also use FastDivmod the performance improves

We need it for both store and load. store is actually more important.

On my 3080

FP32 accumulation is throttled. So let us just use fp16 accumulation. Kernel name is cutlass_tensorop_h16816dgrad_optimized_*

cmake .. -DCUTLASS_NVCC_ARCHS=80

IIRC, 3080 can use 86 to compile.

What problem size or kernel does every line in your performance table use?

Also, could you please run this problem size

--n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1

Apr 03 '24 02:04 hwu36

Is it also worth to have only store numbers in the above table?

Apr 03 '24 03:04 manishucsd

The followup needs some exploration from yourself. You may not need to do anything, or you may need to make some small changes like the first one. I have not looked into it very deep myself.

Dgrad can be used as deconv (or transposed conv). @masahi contributed deconv in cutlass 2.9 (https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d). We want the output of deconv is not packed. For example, the output problem size is 34x28x28x256, but the output tensor specified by the user can be 34x32x32x512. The user wants to have some bubbles (e.g. 0s) in the output data. This maybe already supported or maybe not. That is the first thing needs to be figured out.

We have the similar requirements to the regular fprop conv. https://github.com/NVIDIA/cutlass/pull/1437 was written to meet this request. When we use packed output, fprop conv can reuse the same epilogue as gemm which is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h (Ignore scatter and permute code in this file). The main difference between the packed and non packed cod is what you are already familiar with

          Stride tensor_coord = CoordinateDecompositionLittleEndian<kStrideRank>(row_offset + thread_start_row_, params_.divmod);

          LongIndex tensor_offset = dot(tensor_coord, params_.tensor_stride);

It decomposes the row number into n, p, q and use the stride to compute the new row number.

Apr 03 '24 18:04 hwu36

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

May 03 '24 19:05 github-actions[bot]

cutlass cutlass copied to clipboard

[QST] Strided dgrad conv epilogue does not use fast divmod

cutlass
cutlass copied to clipboard