llm.c
llm.c copied to clipboard
Second matmul for fully custom attention
So far, just in the /dev files, because for the main script we also need to touch backward. For some reason, I see considerable speed-up in the benchmarks here, but in my attempts to use this in the main model, this hasn't really translated.
What is the speed of the matmul_tri compared with cublas?
On my A4000, cublas (no Tensorcores) is getting reported at 52% of the FP32 capacity, whereas this kernel gets 33%. So it is overall slower, but as it calculates only half, it still wins out. That changes with tensorcores, though.
I think its the writing back of results that still is quite bad here.
Some more optimizations, and now its slightly faster than the tensorcore counterparts. With getting rid of the permutes, this yields a substantial net speedup for the attention kernel. Unfortunately, we cannot yet use this in the main model, because the backward still assumes the permutations.