coleased memory reads for faster backward pass in attention

Open ngc92 opened this issue 2 years ago • 0 comments

Uses one warp (instead of one thread) for each result that is to be computed. We gain coalesced access in the inner loop, translating to a tremendous speedup.

Apr 17 '24 22:04 ngc92