llm.c
llm.c copied to clipboard
coleased memory reads for faster backward pass in attention
Uses one warp (instead of one thread) for each result that is to be computed. We gain coalesced access in the inner loop, translating to a tremendous speedup.