flash-attention Turing architecture support

Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^

Sep 14 '23 15:09 SimJeg

I haven't had much bandwidth to work on Turing.

Sep 14 '23 17:09 tridao

@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.

Sep 18 '23 14:09 SimJeg

@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?

Sep 29 '23 18:09 david-macleod

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

Oct 04 '23 19:10 jfpuget

I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?

Oct 04 '23 19:10 tridao

Awesome. Yes, just inference in that competition.

Oct 04 '23 20:10 jfpuget

Hi, has there been any update on this?

Feb 09 '24 19:02 sumanthnallamotu

Hi, has there been any update on this?

No I haven't had much time

Feb 09 '24 20:02 tridao

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize

Apr 03 '24 16:04 suicao

Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.

Apr 06 '24 09:04 chuanzhubin

Still no news?

Apr 30 '24 17:04 Dampfinchen

Nope I've had no bandwidth

Apr 30 '24 18:04 tridao

OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):

https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py

May 01 '24 01:05 rationalism

So does this mean 2070s supports at least flash attention 1? Is that the same as SDPA? I was under impression that my GPU got no luck for any kind of flash attention, and the kohya_ss trainer keeps saying "Torch was not compiled with flash attention" even though I enabled SDPA and it's indeed faster. I wonder if I should even bother looking into that.

Sep 14 '24 18:09 Seedmanc

flash-attention flash-attention copied to clipboard

Turing architecture support

flash-attention
flash-attention copied to clipboard