llm.c
llm.c copied to clipboard
Micro optimization for `softmax_forward_kernel5`
This branch includes a micro-optimization for softmax_forward_kernel5
.
Summary
-
~~use
warpReduceMax
inattention_forward.cu
to use__shfl_down_sync
to be consistent with the other kernels (reduce to all threads in a warp)~~ -
micro optimization for
softmax_forward_kernel5
-
Result from
ncu ./profile_gpt2cu
: compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):- Duration: 1.47 ms -> 1.38 ms
- Compute (SM) [%]: 77.11% -> 78.68%
- DRAM Throughput [%]: 45.03% -> 47.91%
-
-
tests done:
-
./profile_gpt2cu
-
./attention_forward 4
-
./attention_forward 5
-
Output from modified code
- NCU log using A30
make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu
softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:45:01, Context 1, Stream 16
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 1.21
SM Frequency cycle/usecond 929.76
Elapsed Cycles cycle 1283575
Memory [%] % 54.15
DRAM Throughput % 47.91
Duration msecond 1.38
L1/TEX Cache Throughput % 54.50
L2 Cache Throughput % 51.48
SM Active Cycles cycle 1275362.68
Compute (SM) [%] % 78.68
---------------------------------------------------------------------- --------------- ------------------------------
Output from the original code
- NCU log using A30
make profile_gpt2cu NO_MULTI_GPU=1
ncu ./profile_gpt2cu
softmax_forward_kernel5(__nv_bfloat16 *, float, const __nv_bfloat16 *, int, int), 2024-Sep-20 01:49:03, Context 1, Stream 16
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 1.21
SM Frequency cycle/usecond 928.26
Elapsed Cycles cycle 1366538
Memory [%] % 45.03
DRAM Throughput % 45.03
Duration msecond 1.47
L1/TEX Cache Throughput % 33.10
L2 Cache Throughput % 48.18
SM Active Cycles cycle 1358789.59
Compute (SM) [%] % 77.11
---------------------------------------------------------------------- --------------- ------------------------------
output from ./attention_forward
nvcc -O3 --use_fast_math -lcublas -lcublasLt attention_forward.cu -o attention_forward
- testing softmax_forward_kernel4
# ./attention_forward 4
enable_tf32: 1
Using kernel 4
Checking block size 32.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 64.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 128.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 256.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
Checking block size 512.
-0.529510 -0.529297
0.889394 0.889160
0.881674 0.881836
0.651789 0.651855
-0.483486 -0.483398
1.000000 1.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
0.000000 0.000000
All results match. Starting benchmarks.
block_size 32 | time 2.794404 ms
block_size 64 | time 2.136679 ms
block_size 128 | time 2.125906 ms
block_size 256 | time 2.128598 ms
block_size 512 | time 2.151445 ms
- testing softmax_forward_kernel5
# ./attention_forward 5
enable_tf32: 1
Using kernel 5
Checking block size 32.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 64.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 128.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 256.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
Checking block size 512.
-0.529510 -0.531250
0.889394 0.890625
0.881674 0.882812
0.651789 0.652344
-0.483486 -0.484375
All results match. Starting benchmarks.
block_size 32 | time 2.016379 ms
block_size 64 | time 1.455155 ms
block_size 128 | time 1.452482 ms
block_size 256 | time 1.450271 ms
block_size 512 | time 1.454224 ms