llm.c
llm.c copied to clipboard
[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15
This kernel could be a demonstration of leveraging the PMPP materials in practice.
Modification of
The performance of the kernel depends on
const int coarse_factor
const int block_size_y
With the setting in the PR [didn't perform an extensive tuning], on the 3070, the performance of kernel 10 is comparable to 7, 8 and newly added 9 (rebased to master).
Using kernel 7
block_size 32 time 0.1416 ms
block_size 64 time 0.1405 ms
block_size 128 time 0.1422 ms
block_size 256 time 0.1405 ms
block_size 512 time 0.1401 ms
block_size 768 time 0.1408 ms
block_size 1024 time 0.1430 ms
Using kernel 8
block_size 32 time 0.1622 ms
block_size 64 time 0.1521 ms
block_size 128 time 0.1539 ms
block_size 256 time 0.1440 ms
block_size 512 time 0.1463 ms
block_size 768 time 0.1318 ms
block_size 1024 time 0.1408 ms
Using kernel 9
block_size 32 time 0.1582 ms
block_size 64 time 0.1453 ms
block_size 128 time 0.1462 ms
block_size 256 time 0.1345 ms
block_size 512 time 0.1344 ms
block_size 768 time 0.1314 ms
block_size 1024 time 0.1414 ms
Using kernel 10
block_size 32 time 0.1386 ms
block_size 64 time 0.1413 ms
block_size 128 time 0.1438 ms
block_size 256 time 0.1399 ms
block_size 512 time 0.1412 ms
block_size 768 time 0.1420 ms
block_size 1024 time 0.1935 ms