cutlass
cutlass copied to clipboard
[QST] Strided dgrad conv epilogue does not use fast divmod
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h#L315-L318
This piece of code can be replaced by using fast divmod. The same can be applied to the store function below.
@hwu36
@manishucsd
example code is https://github.com/NVIDIA/cutlass/pull/1437/files#diff-3e442e06009337880fcf6b4f9286783994269e20d10392afa4a1fa981626376dR388-R390
https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md
also our gtc cutlass talks mentioned the basic concepts, conv, strided dgrad in details.
Yes, you can replace these lines with fast_divmod
. I remember doing this change and surprisingly the performance dropped a couple of TFLOP/s. Feel free to make a change and report before/after performance numbers.
Direct link to strided dgrad GTC talk.
Yes, you can replace these lines with
fast_divmod
.
hi, the p_
and q_
are computed from start_r
and start_s
(start_r
->start_h_
->p_
), while start_r
and start_s
are decided by the thread block idx which is dynamic.
Do you mean to construct the fast_divmod
object on the device directly? If so, I think maybe we have to do the raw divmod
here. Int32 divmod
only takes about 20 instructions which is much cheaper than calculating the multiplier and shift of fast_divmod
.
On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt
(i'll omit the other two lines assembly for brevity for now) AFTER this code:
params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);
leads to
2651 0000000f 00c699a0 ISETP.NE.AND P4, PT, R9, 0x1, PT 133 0 0
2720 0000000f 00c69df0 ISETP.NE.AND P0, PT, R42, 0x1, PT 149 0 0
2721 0000000f 00c69e00 IMAD.MOV.U32 R40, RZ, RZ, R11 150 0 0
2722 0000000f 00c69e10 @P0 IMAD.HI.U32 R2, R40, R2, RZ 149 0 0
2723 0000000f 00c69e20 MOV R11, R7 150 0 0
2724 0000000f 00c69e30 IMAD.MOV.U32 R7, RZ, RZ, R0 150 0 0
2725 0000000f 00c69e40 IMAD.MOV.U32 R0, RZ, RZ, R40 150 0 0
2726 0000000f 00c69e50 @P0 SHF.R.U32.HI R0, RZ, R43, R2 150 0 0
assembly
Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed
the FastDivmod
was formed like this:
params_.divmod = FastDivmod(p_*q_);
params_.divmod_two = FastDivmod(params_.problem_size.Q);
all tests pass. @hwu36
Tangential: Does anyone know of a more convenient way to extract relevant assembly in ncu-ui? I like how it will correlate your source code to the assembly but it doesn't give you an option to extract exclusively those lines... Feels like an easy feature
Like @manishucsd has stated I am consistently seeing that these changes are performing more poorly.
edit: If I refactor store_with_byte_offset
to also use FastDivmod
the performance improves
(GFLOPS)
original | load only | Load & store
-- | -- | --
42438.4 | 42347.2 | 44057.6
45593.2 | 40910 | 47485.4
46520.2 | 35192.5 | 45204.9
51508.7 | 50302.6 | 51564
50227 | 49132.7 | 50302.6
46430.3 | 45555.7 | 46096.4
50418 | 47765.6 | 49985.4
37307.6 | 34911.3 | 37978.4
3561.23 | 3234.04 | 3644.71
47676.7 | 47130.3 | 49490
29084.6 | 28995.8 | 29655.5
3403.02 | 3120.96 | 3496.45
52486.1 | 49904.3 | 51803.7
50597.1 | 48242 | 50220.4
36889 | 38074.3 | 39168.1
53879.7 | 56330.6 | 56537.7
49410.5 | 51800.2 | 52172.7
44888.3 | 40922.8 | 46548.4
48512.8 | 47529.4 | 50144.7
43376.9 | 41792.6 | 43077.8
34642.4 | 33419.9 | 34349.9
50375.1 | 48202.7 | 49962.7
41098 | 41014.7 | 42323.9
33170.5 | 33063.6 | 33800.4
44329.3 | 42541.7 | 43949.3
41890.5 | 38951.8 | 41537.5
30802.9 | 29376.8 | 30572.7
45720.9 | 43954.4 | 47196.9
43627.7 | 42055.4 | 45114.7
39782 | 38482.1 | 39485.8
35453.7 | 33411.2 | 37222.9
33715.9 | 30371.1 | 33430.1
26631.1 | 24955.4 | 27061.2
47365.5 | 47304.3 | 49073.2
40151.7 | 36906.7 | 41618.4
6243.1 | 6177.41 | 6145.52
44085.4 | 42379.9 | 45434.6
40475 | 39074.5 | 41674.8
8948.19 | 8967.81 | 8928.76
47597.1 | 44222.3 | 47315.9
43240.1 | 41711 | 44627.8
37502.5 | 36286.8 | 38464.7
37327.7 | 34334.6 | 37970.9
36970.6 | 33005.4 | 35697.2
26558.4 | 25751.8 | 27999.2
from running
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=256 --k=256 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1
with these variables
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS="cutlass_tensorop_f16_s16816dgrad_optimized_f16_*" -DCMAKE_CUDA_FLAGS="-lineinfo"
(no debug flag)
Thank you @ZelboK.
I refactor store_with_byte_offset to also use FastDivmod the performance improves
We need it for both store and load. store is actually more important.
On my 3080
FP32 accumulation is throttled. So let us just use fp16 accumulation. Kernel name is cutlass_tensorop_h16816dgrad_optimized_*
cmake .. -DCUTLASS_NVCC_ARCHS=80
IIRC, 3080 can use 86 to compile.
What problem size or kernel does every line in your performance table use?
Also, could you please run this problem size
--n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1
Is it also worth to have only store numbers in the above table?
The followup needs some exploration from yourself. You may not need to do anything, or you may need to make some small changes like the first one. I have not looked into it very deep myself.
Dgrad can be used as deconv (or transposed conv). @masahi contributed deconv in cutlass 2.9 (https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d). We want the output of deconv is not packed. For example, the output problem size is 34x28x28x256, but the output tensor specified by the user can be 34x32x32x512. The user wants to have some bubbles (e.g. 0s) in the output data. This maybe already supported or maybe not. That is the first thing needs to be figured out.
We have the similar requirements to the regular fprop conv. https://github.com/NVIDIA/cutlass/pull/1437 was written to meet this request. When we use packed output, fprop conv can reuse the same epilogue as gemm which is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h (Ignore scatter and permute code in this file). The main difference between the packed and non packed cod is what you are already familiar with
Stride tensor_coord = CoordinateDecompositionLittleEndian<kStrideRank>(row_offset + thread_start_row_, params_.divmod);
LongIndex tensor_offset = dot(tensor_coord, params_.tensor_stride);
It decomposes the row number into n, p, q and use the stride to compute the new row number.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.