llm.c Add optimized GPU kernels for encoder_backward using shared memory

This commit introduces a new optimized kernel for the positional encoder backward pass:

Kernel Version 3: Uses shared memory to reduce global memory accesses and improve performance. Each thread block loads data into shared memory and performs atomic additions to dwte and dwpe.

Slightly improves performance as seen from this basic timing result:

Kernel 2 (pre-existing): Screenshot 2024-07-01 at 1 18 13 PM

New Kernel 3:

More detailed profiling (used Nsight systems on modal benchmark script: Nvidia A100 + 80GB MEM) :

Kernel 2: Screenshot 2024-07-01 at 1 59 23 PM

Kernel 3: Screenshot 2024-07-01 at 2 00 06 PM

Jul 01 '24 08:07 vyom1611

I don't think we can merge this because we need determinism, this uses atomicAdd. Summoning @ademeure for comment too

Jul 01 '24 18:07 karpathy

It's fine for /dev/cuda/ which is the only thing this PR touches though - I don't mind including this new kernel as an extra comparison point which can be useful! It shows that using shared memory helps but only slightly. But I'm not sure whether it's worth it since we can't actually use it for the training code.

The bigger issue is /dev/cuda/encoder_backward.cu is out of date and doesn't include the kernels we actually use (which avoids the atomics). There's a few other /dev/cuda/ files which are out of date unfortunately.

Jul 01 '24 18:07 ademeure