llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15

Open lancerts opened this issue 9 months ago • 5 comments

This kernel could be a demonstration of leveraging the PMPP materials in practice. Modification of image

The performance of the kernel depends on

    const int coarse_factor 
    const int block_size_y 

With the setting in the PR [didn't perform an extensive tuning], on the 3070, the performance of kernel 10 is comparable to 7, 8 and newly added 9 (rebased to master).


Using kernel 7

block_size 32 time 0.1416 ms
block_size 64 time 0.1405 ms
block_size 128 time 0.1422 ms
block_size 256 time 0.1405 ms
block_size 512 time 0.1401 ms
block_size 768 time 0.1408 ms
block_size 1024 time 0.1430 ms


Using kernel 8

block_size 32 time 0.1622 ms
block_size 64 time 0.1521 ms
block_size 128 time 0.1539 ms
block_size 256 time 0.1440 ms
block_size 512 time 0.1463 ms
block_size 768 time 0.1318 ms
block_size 1024 time 0.1408 ms

Using kernel 9

block_size 32 time 0.1582 ms
block_size 64 time 0.1453 ms
block_size 128 time 0.1462 ms
block_size 256 time 0.1345 ms
block_size 512 time 0.1344 ms
block_size 768 time 0.1314 ms
block_size 1024 time 0.1414 ms

Using kernel 10
block_size 32 time 0.1386 ms
block_size 64 time 0.1413 ms
block_size 128 time 0.1438 ms
block_size 256 time 0.1399 ms
block_size 512 time 0.1412 ms
block_size 768 time 0.1420 ms
block_size 1024 time 0.1935 ms

lancerts avatar May 16 '24 05:05 lancerts