llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

coleased memory reads for faster backward pass in attention

Open ngc92 opened this issue 2 years ago • 0 comments

Uses one warp (instead of one thread) for each result that is to be computed. We gain coalesced access in the inner loop, translating to a tremendous speedup.

ngc92 avatar Apr 17 '24 22:04 ngc92