llm.c
llm.c copied to clipboard
Fix the bug that yields cpu, gpu results mismatch in crossentropy_softmax_backward.cu
With master branch
❯ nvcc -O3 --use_fast_math crossentropy_softmax_backward.cu -o crossentropy_softmax_backward
❯ ./crossentropy_softmax_backward 1
Using kernel 1
Checking block size 32.
-0.438539 -0.438539
0.136152 0.136152
-0.364946 -0.364946
-0.384722 -0.384722
-0.530659 -0.530659
Checking block size 64.
-0.438539 -0.877079
Mismatch of dlogits at 0: CPU_ref: -0.438539 vs GPU: -0.877079
0.136152 0.272304
Mismatch of dlogits at 1: CPU_ref: 0.136152 vs GPU: 0.272304
-0.364946 -0.729892
Mismatch of dlogits at 2: CPU_ref: -0.364946 vs GPU: -0.729892
-0.384722 -0.769444
Mismatch of dlogits at 3: CPU_ref: -0.384722 vs GPU: -0.769444
-0.530659 -1.061317
Mismatch of dlogits at 4: CPU_ref: -0.530659 vs GPU: -1.061317
Mismatch of dlogits at 5: CPU_ref: 0.389889 vs GPU: 0.779779
Mismatch of dlogits at 6: CPU_ref: 0.212416 vs GPU: 0.424832
Mismatch of dlogits at 7: CPU_ref: -0.345777 vs GPU: -0.691555
Mismatch of dlogits at 8: CPU_ref: 0.286473 vs GPU: 0.572946
Mismatch of dlogits at 9: CPU_ref: -0.069573 vs GPU: -0.139146
The cause is dlogits_bt[v] += (p - indicator) * dloss;
in the GPU kernel, the dlogits
needs to be reset before every run.
we can't just malloc on repeat, without free. maybe memset to zero if needed?
we can't just malloc on repeat, without free. maybe memset to zero if needed?
@karpathy good point... Always forget we are dealing with raw ptr instead of smart ptr : P. Uses the memset to 0 and the script executes all passed.