flash-attention
flash-attention copied to clipboard
Turning support?
A while it said that turning support is coming. What is preventing the compilation on turning now? Is it the lack of BF16? Can that just be disabled for turning or disabled overall with a compile flag? Is there any more to it? There are now some 22g 2080Ti so it would be cool to have them work along with ampere cards.
I'm one-upping this for use on Google Colab T4 environment.
Unfortunately I haven't had much bandwidth.
I was able to get everything to compile, but in some cases I get an "invalid argument" cuda assertion. Also not sure of the performance.
Turing cards have less shared memory (64KB instead of 99KB or 163KB on Ampere) so that might require adjusting the block sizes currently used.
It worked on small outputs but then failed on large ones. I have to figure out to see where. I didn't edit the block size section that is mainly done for sm8x/90 and it's a likely culprit. I am really only using the forward pass for exllamav2.
edit: I tried limiting the kernel to only blocks of 32 but I'm getting the invalid argument error still. I spent all day trying to figure out how to get a line number out of pytorch, but it seems I need to get a version with DSA enabled (or compile pt, ouch) or some other method of debugging.
Interested in this too.
Also interested in this to run in AWS Graviton servers.
Interested in this too.
Interested in this as well.
I am working now on some petproject combining "the era of 1-bit LLM" paper approach (efficiently showing that 1.5 bit per parameter is enough) with a ReLoRA paper (since unlike the original "1-bit LLM" guys who shown that 1.5 of information capacity is enough - I actually use models quantized with that approach, so I need the method which does not bypass gradients through the main model weights).
However, even so it still means I need to produce quite a big intermediate matrices during forward-and-backward pass, which Flash Attention do on the fly, as far as I understood?
So it would be quite nice to see how all the three techniques combined will work (quite a minimal model weights + no big gradients & optimizer state due to the ReLoRA + less intermediate matrices spawned because of flashattention).
Turing cards have less shared memory (64KB instead of 99KB or 163KB on Ampere) so that might require adjusting the block sizes currently used.
Does this cause the need to spend a lot of effort to complete support?
That is editable in one file as far as I saw.
OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):
https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py
Llama.cpp also has an implementation. The only problem with the vllm/oai implementation is that it isn't drop-in like the one here.
+1
@rationalism how do you install that version you mentioned? I need to use it with InternVL mentioned here. Thanks.
I know that version. It's slow and it is missing a lot of stuff. I think vllm supports it. Anyways, I made a version that "works" out of iirc 2.5.x?. Unfortunately it cranks for a while and then fails with "invalid parameter". I printed all the tensors and they appear normal until I hit the error. Even shrunk the sram on some of the kernels for it to fit. I should post it up at some point, maybe someone will be able to figure it out that's better than me. Basically dies half way into an output. Only care about the forward pass and not training.
https://github.com/Ph0rk0z/flash-attn-turning/commit/b3dc600f3916528105462e918434921f1dc65459