llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Include a factor in add bias kernel of matmul for perf tuning

Open lancerts opened this issue 10 months ago • 0 comments

A larger thread_reuse_factor reduces the number of threads launched while increasing the per-thread load. Depending on the value of B * T * OC and the GPU card, it is a tunable parameter that yields different performance.

TLDR, For the grid-stride loop, we do not necessarily need to use int grid_size = ceil_div(OC * B * T, block_size);, a smaller grid with an overall smaller number of threads may increase the performance.

3070 results:

  • Kernel 2, thread_reuse_factor = 1 (baseline) sqrt_block_size 4 | time 4.6463 ms | tflops 8.32 sqrt_block_size 8 | time 3.0348 ms | tflops 12.74 sqrt_block_size 16 | time 3.0716 ms | tflops 12.58 sqrt_block_size 32 | time 3.1900 ms | tflops 12.12

  • Kernel 2, thread_reuse_factor = 16 sqrt_block_size 4 | time 3.5142 ms | tflops 11.00 ---> non-trivial improvement sqrt_block_size 8 | time 2.9833 ms | tflops 12.96 ---> marginal improvement sqrt_block_size 16 | time 3.0132 ms | tflops 12.83 ---> marginal improvement sqrt_block_size 32 | time 3.0475 ms | tflops 12.68 ---> marginal improvement

  • Kernel 2, thread_reuse_factor = 32 sqrt_block_size 4 | time 3.5902 ms | tflops 10.77 sqrt_block_size 8 | time 3.0112 ms | tflops 12.84 sqrt_block_size 16 | time 3.0086 ms | tflops 12.85 sqrt_block_size 32 | time 3.0682 ms | tflops 12.60

  • Kernel 2, thread_reuse_factor = 256 sqrt_block_size 4 | time 3.5121 ms | tflops 11.01 sqrt_block_size 8 | time 3.0439 ms | tflops 12.70 sqrt_block_size 16 | time 3.0268 ms | tflops 12.77 sqrt_block_size 32 | time 3.0893 ms | tflops 12.51

  • Kernel 3 is still faster. sqrt_block_size 4 | time 3.3575 ms | tflops 11.51 sqrt_block_size 8 | time 2.5413 ms | tflops 15.21 sqrt_block_size 16 | time 2.5077 ms | tflops 15.41 sqrt_block_size 32 | time 2.5044 ms | tflops 15.43

lancerts avatar Apr 15 '24 14:04 lancerts