llm.c Fix the bug that yields cpu, gpu results mismatch in crossentropy_softmax

Fix the bug that yields cpu, gpu results mismatch in crossentropy_softmax_backward.cu

Open lancerts opened this issue 10 months ago • 2 comments

With master branch

❯ nvcc -O3 --use_fast_math crossentropy_softmax_backward.cu -o crossentropy_softmax_backward
❯ ./crossentropy_softmax_backward 1
Using kernel 1
Checking block size 32.
-0.438539 -0.438539
0.136152 0.136152
-0.364946 -0.364946
-0.384722 -0.384722
-0.530659 -0.530659
Checking block size 64.
-0.438539 -0.877079
Mismatch of dlogits at 0: CPU_ref: -0.438539 vs GPU: -0.877079
0.136152 0.272304
Mismatch of dlogits at 1: CPU_ref: 0.136152 vs GPU: 0.272304
-0.364946 -0.729892
Mismatch of dlogits at 2: CPU_ref: -0.364946 vs GPU: -0.729892
-0.384722 -0.769444
Mismatch of dlogits at 3: CPU_ref: -0.384722 vs GPU: -0.769444
-0.530659 -1.061317
Mismatch of dlogits at 4: CPU_ref: -0.530659 vs GPU: -1.061317
Mismatch of dlogits at 5: CPU_ref: 0.389889 vs GPU: 0.779779
Mismatch of dlogits at 6: CPU_ref: 0.212416 vs GPU: 0.424832
Mismatch of dlogits at 7: CPU_ref: -0.345777 vs GPU: -0.691555
Mismatch of dlogits at 8: CPU_ref: 0.286473 vs GPU: 0.572946
Mismatch of dlogits at 9: CPU_ref: -0.069573 vs GPU: -0.139146

The cause is dlogits_bt[v] += (p - indicator) * dloss; in the GPU kernel, the dlogits needs to be reset before every run.

Apr 18 '24 00:04 lancerts

we can't just malloc on repeat, without free. maybe memset to zero if needed?

Apr 18 '24 03:04 karpathy

we can't just malloc on repeat, without free. maybe memset to zero if needed?

@karpathy good point... Always forget we are dealing with raw ptr instead of smart ptr : P. Uses the memset to 0 and the script executes all passed.

Apr 18 '24 04:04 lancerts

llm.c llm.c copied to clipboard

Fix the bug that yields cpu, gpu results mismatch in crossentropy_softmax_backward.cu

llm.c
llm.c copied to clipboard