llm.c
llm.c copied to clipboard
Fix the unsupported block_size in matmul_backward_bias kernel 1
Due to the reduction in line https://github.com/karpathy/llm.c/blob/master/dev/cuda/matmul_backward_bias.cu#L67
The block size needs to be the power of 2 for the kernel 1
. Otherwise the GPU result is wrong:
Using kernel 1
Checking correctness...
1.125032 1.054688
26.328930 26.375000
11.523789 11.437500
17.523972 17.500000
16.781761 16.750000
All results match for block_size=32.
Checking correctness...
1.125032 1.054688
26.328930 26.375000
11.523789 11.437500
17.523972 17.500000
16.781761 16.750000
All results match for block_size=64.
Checking correctness...
1.125032 1.054688
26.328930 26.375000
11.523789 11.437500
17.523972 17.500000
16.781761 16.750000
All results match for block_size=128.
Checking correctness...
1.125032 1.054688
26.328930 26.375000
11.523789 11.437500
17.523972 17.500000
16.781761 16.750000
All results match for block_size=256.
Checking correctness...
1.125032 1.054688
26.328930 26.375000
11.523789 11.437500
17.523972 17.500000
16.781761 16.750000
All results match for block_size=512.
Checking correctness...
1.125032 3.562500
Mismatch of dbias at 0: CPU_ref: 1.125032 vs GPU: 3.562500
26.328930 19.500000
Mismatch of dbias at 1: CPU_ref: 26.328930 vs GPU: 19.500000
11.523789 2.765625
Mismatch of dbias at 2: CPU_ref: 11.523789 vs GPU: 2.765625
17.523972 8.125000
Mismatch of dbias at 3: CPU_ref: 17.523972 vs GPU: 8.125000
16.781761 -1.429688
Mismatch of dbias at 4: CPU_ref: 16.781761 vs GPU: -1.429688
Mismatch of dbias at 5: CPU_ref: -6.718957 vs GPU: 10.437500
Mismatch of dbias at 6: CPU_ref: 10.953457 vs GPU: -3.593750
Mismatch of dbias at 7: CPU_ref: 14.391066 vs GPU: -98.500000
Mismatch of dbias at 9: CPU_ref: -34.071350 vs GPU: -23.625000
Mismatch of dbias at 10: CPU_ref: -0.218756 vs GPU: -22.000000
[The reduction for non-power-2 blocksize is more complicated and may not be suitable for kernel 1 as the demonstration].