llm.c Second matmul for fully custom attention

Second matmul for fully custom attention

Open ngc92 opened this issue 10 months ago • 3 comments

So far, just in the /dev files, because for the main script we also need to touch backward. For some reason, I see considerable speed-up in the benchmarks here, but in my attempts to use this in the main model, this hasn't really translated.

Apr 22 '24 23:04 ngc92

What is the speed of the matmul_tri compared with cublas?

Apr 24 '24 20:04 FeSens

On my A4000, cublas (no Tensorcores) is getting reported at 52% of the FP32 capacity, whereas this kernel gets 33%. So it is overall slower, but as it calculates only half, it still wins out. That changes with tensorcores, though.

I think its the writing back of results that still is quite bad here.

Apr 24 '24 20:04 ngc92

Some more optimizations, and now its slightly faster than the tensorcore counterparts. With getting rid of the permutes, this yields a substantial net speedup for the attention kernel. Unfortunately, we cannot yet use this in the main model, because the backward still assumes the permutations.

Apr 27 '24 11:04 ngc92

llm.c llm.c copied to clipboard

Second matmul for fully custom attention

llm.c
llm.c copied to clipboard