flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

Turing architecture support

Open SimJeg opened this issue 2 years ago • 14 comments

Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^

SimJeg avatar Sep 14 '23 15:09 SimJeg

I haven't had much bandwidth to work on Turing.

tridao avatar Sep 14 '23 17:09 tridao

@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.

SimJeg avatar Sep 18 '23 14:09 SimJeg

@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?

david-macleod avatar Sep 29 '23 18:09 david-macleod

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

jfpuget avatar Oct 04 '23 19:10 jfpuget

I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?

tridao avatar Oct 04 '23 19:10 tridao

Awesome. Yes, just inference in that competition.

jfpuget avatar Oct 04 '23 20:10 jfpuget

Hi, has there been any update on this?

sumanthnallamotu avatar Feb 09 '24 19:02 sumanthnallamotu

Hi, has there been any update on this?

No I haven't had much time

tridao avatar Feb 09 '24 20:02 tridao

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize

suicao avatar Apr 03 '24 16:04 suicao

Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.

chuanzhubin avatar Apr 06 '24 09:04 chuanzhubin

Still no news?

Dampfinchen avatar Apr 30 '24 17:04 Dampfinchen

Nope I've had no bandwidth

tridao avatar Apr 30 '24 18:04 tridao

OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):

https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py

rationalism avatar May 01 '24 01:05 rationalism

So does this mean 2070s supports at least flash attention 1? Is that the same as SDPA? I was under impression that my GPU got no luck for any kind of flash attention, and the kohya_ss trainer keeps saying "Torch was not compiled with flash attention" even though I enabled SDPA and it's indeed faster. I wonder if I should even bother looking into that.

Seedmanc avatar Sep 14 '24 18:09 Seedmanc