llm.c
llm.c copied to clipboard
Include a factor in add bias kernel of matmul for perf tuning
A larger thread_reuse_factor
reduces the number of threads launched while increasing the per-thread load.
Depending on the value of B * T * OC
and the GPU card, it is a tunable parameter that yields different performance.
TLDR, For the grid-stride loop, we do not necessarily need to use int grid_size = ceil_div(OC * B * T, block_size);
, a smaller grid with an overall smaller number of threads may increase the performance.
3070 results:
-
Kernel 2,
thread_reuse_factor = 1
(baseline) sqrt_block_size 4 | time 4.6463 ms | tflops 8.32 sqrt_block_size 8 | time 3.0348 ms | tflops 12.74 sqrt_block_size 16 | time 3.0716 ms | tflops 12.58 sqrt_block_size 32 | time 3.1900 ms | tflops 12.12 -
Kernel 2,
thread_reuse_factor = 16
sqrt_block_size 4 | time 3.5142 ms | tflops 11.00 ---> non-trivial improvement sqrt_block_size 8 | time 2.9833 ms | tflops 12.96 ---> marginal improvement sqrt_block_size 16 | time 3.0132 ms | tflops 12.83 ---> marginal improvement sqrt_block_size 32 | time 3.0475 ms | tflops 12.68 ---> marginal improvement -
Kernel 2,
thread_reuse_factor = 32
sqrt_block_size 4 | time 3.5902 ms | tflops 10.77 sqrt_block_size 8 | time 3.0112 ms | tflops 12.84 sqrt_block_size 16 | time 3.0086 ms | tflops 12.85 sqrt_block_size 32 | time 3.0682 ms | tflops 12.60 -
Kernel 2,
thread_reuse_factor = 256
sqrt_block_size 4 | time 3.5121 ms | tflops 11.01 sqrt_block_size 8 | time 3.0439 ms | tflops 12.70 sqrt_block_size 16 | time 3.0268 ms | tflops 12.77 sqrt_block_size 32 | time 3.0893 ms | tflops 12.51 -
Kernel 3 is still faster. sqrt_block_size 4 | time 3.3575 ms | tflops 11.51 sqrt_block_size 8 | time 2.5413 ms | tflops 15.21 sqrt_block_size 16 | time 2.5077 ms | tflops 15.41 sqrt_block_size 32 | time 2.5044 ms | tflops 15.43