cutlass
cutlass copied to clipboard
Refactor to use FastDivmod for predicated strided dgrad iterators.
On my 3080
BEFORE
line:
int n = npq_offset / (p_ * q_);
translates to
before_first_line_sass.txt
line:
int residual = npq_offset % (p_ * q_);
translates to
before_second_line_sass.txt
(i'll omit the other two lines assembly for brevity for now)
AFTER
this code:
params_.divmod(n, residual, npq_offset);
params_.divmod_two(p, q, residual);
leads to
2651 0000000f 00c699a0 ISETP.NE.AND P4, PT, R9, 0x1, PT 133 0 0
2720 0000000f 00c69df0 ISETP.NE.AND P0, PT, R42, 0x1, PT 149 0 0
2721 0000000f 00c69e00 IMAD.MOV.U32 R40, RZ, RZ, R11 150 0 0
2722 0000000f 00c69e10 @P0 IMAD.HI.U32 R2, R40, R2, RZ 149 0 0
2723 0000000f 00c69e20 MOV R11, R7 150 0 0
2724 0000000f 00c69e30 IMAD.MOV.U32 R7, RZ, RZ, R0 150 0 0
2725 0000000f 00c69e40 IMAD.MOV.U32 R0, RZ, RZ, R40 150 0 0
2726 0000000f 00c69e50 @P0 SHF.R.U32.HI R0, RZ, R43, R2 150 0 0
assembly
Last 3 columns are:
Live Registers, Warp Stall Sampling, Instructions Executed
the FastDivmod
was formed like this:
params_.divmod = FastDivmod(p_*q_);
params_.divmod_two = FastDivmod(params_.problem_size.Q);
all tests pass. @hwu36
Here are the benchmarks from cutlass_profiler
from running
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s16816dgrad_optimized_f16_* --n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1 --output=load_store_k1024.csv
GFLOPS
Operation,normal,Load_Store,Load,Store
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8,36093.7,35460.4,34779.9,35521.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align4,38092.7,35749.1,33229.5,33216.6
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align2,37974.5,28202.3,26924.2,38842.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align8,46247.2,45844.8,46530.2,46529.9
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align4,45534,44948.4,45967.9,46057.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x256_32x3_nhwc_align2,42966.6,41820.4,41601.9,43779.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align8,32767,27551.1,31058.6,31742.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align4,27299.7,20540.9,24321.3,26288.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x3_nhwc_align2,3102.06,3124.94,3107.72,2785.73
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align8,32597,26568.1,29991.2,30956.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align4,21633.2,18499.1,21158.8,21662
cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x64_32x4_nhwc_align2,3086.17,3099.21,3102.85,2779.92
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align8,46073,44842.5,44539.3,44795.6
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align4,44521.9,43695.6,43267.6,43731.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x256_32x4_nhwc_align2,35434.7,33203.7,33205.3,35853.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align8,43928.1,43993,43341.7,44614.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align4,40625,39896.3,39928.8,40649.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x3_nhwc_align2,37330.9,29725.7,30495.9,29953.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align8,40452.3,44186,40779.8,44001.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align4,36466.5,38333.6,37859,37978.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x4_nhwc_align2,28703.2,23201.9,28286,23159.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align8,44606,43841,43317.1,43667.7
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align4,38175.5,37856.9,37391.8,38196.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_32x5_nhwc_align2,25967.9,23727.5,26534.2,27765.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align8,32081.1,30121.8,29639.6,28661.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align4,30862.7,28736.9,27380.5,28620.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_32x6_nhwc_align2,25429.1,26662.2,23412,26904.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align8,43407.6,38710.5,37886.3,37235.5
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align4,40653.9,36616.6,35717.4,36235
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_32x6_nhwc_align2,38451.9,34824.1,33954.2,34735.8
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align8,27577.8,23630.1,23171.3,23665.9
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align4,25456.5,22367.1,21874.3,21462.4
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_32x10_nhwc_align2,23168.2,20486.2,20086.8,19702
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align8,44608.9,39704.4,43358.2,43835.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align4,37134.6,25871.8,33237.6,32667
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x128_64x3_nhwc_align2,5909.46,5473.34,5598.99,5362.67
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align8,32847.1,30383.6,29649.8,30323.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align4,30918.4,28999.7,27692.6,28686.4
cutlass_tensorop_f16_s16816dgrad_optimized_f16_128x64_64x3_nhwc_align2,7214.04,6697.29,7109.96,6556.47
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align8,43945,39028,38636.2,39030.1
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align4,40536,36048.2,34889.7,37002.2
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x128_64x3_nhwc_align2,33131.5,32488.8,28507.3,32475
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align8,28371.5,22888.7,23848.9,23171.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align4,27334.6,23543.2,23209.8,23260.3
cutlass_tensorop_f16_s16816dgrad_optimized_f16_64x64_64x5_nhwc_align2,22061.7,21158.2,19387,21181.7
load_store_k1024.conv2d.csv loadk1024.conv2d.csv normal_k1024.conv2d.csv store_k1024.conv2d.csv the_four.csv
@manishucsd
@hwu36
Not seeing benefits from this one either.
ran
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_h16816dgrad_optimized_* --n=34 --h=28 --w=28 --c=512 --k=1024 --r=1 --s=1 --pad_h=0 --pad_w=0 --stride_h=2 --stride_w=2 --dilation_h=1 --dilation_w=1
@ZelboK , You can compile and run only align8 kernels for this shape. Use string "cutlass_tensorop_h16816dgrad_optimized*align8" for cmake and running the cultass_profiler.
The results in comparison_hgrad.csv are with both loads and stores with fast_divmod?
@manishucsd Sorry that file isn't complete, please ignore. I'll paste the complete one here(also with running align8 only) THis one will have load, store, load and store, and normal GFLOPS benchmarks. I'm using a 3080. Could we test this on an A100? I do not have access to one. I'm hoping the pipeline can? hgrad.csv
Thanks @ZelboK for the work on this and analysis. The hgrad.csv
present one problem size running with different tile configurations. Looking at the data in hgrad.csv
the FastDivMod refactoring in both load and store gives significant speedup for the fastest tile.
@hwu36, are you working on this profiling further with more problem sizes on A100 and potentially merging this?
This PR has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d
if there is no activity in the next 60 days.
This PR has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.
i tried on a100 and observed small regression in perf.