flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

WHEN can we get the flash-attention 2.x for Turing GPU ?

Open eileen2003-w opened this issue 1 year ago • 6 comments

I have already downloaded Flash-attention 1.x(actually flash-attn 1.0.8) because currently I only have a GPU with TURING architecture(TITAN RTX). But for my needs (running a demo of a multimodal LLM), it requires flash-attn 2.x, and here is the corresponding code: from flash_ttn import flash_ttn_func as _flash3 attnf unc, flash_attn_varlen_func as _flash_attn_varlen_func from flash_attn.bert_padding import pad_input as _pad_input, index_first_axis as _index_first_axis, unpad_input as _unpad_input flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input

Every time it runs, an error is thrown, and it seems that it cannot be used with the version of 1. x. I am currently very troubled by this issue, and it would be great if there were flash attention available for GPUs that are compatible with the TURING architecture.

eileen2003-w avatar Sep 17 '24 02:09 eileen2003-w

I wish for the same, I don't get why there's a Flash Attention 3 for Hopper GPUs no ordinary consumer can get, (yes I know it's technically still in beta), but still no Turing support for Flash Attention 2

Carnyzzle avatar Sep 20 '24 21:09 Carnyzzle

It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.

tridao avatar Sep 20 '24 21:09 tridao

It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.

Just make a fallback to Flash Attention 1 integrated in Flash Attention 2. It's so, so frustrating. When I pip install torch my training program always falls back to CPU because "torch was not compiled with flash attention". When I uninstall that and get the new build with flash attention compiled it says its not supported on my platform (Turing and Windows).

What the? Why? Just why? Atleast let me use Flash Attention 1.

Dampfinchen avatar Oct 09 '24 17:10 Dampfinchen

I also have the same concern as you, and I really hope that the team can come up with a flash attention 2 that supports the Turing architecture!

wangsen-zy avatar Mar 05 '25 05:03 wangsen-zy

It depends on folks contributing to make it work for Turing.

tridao avatar Mar 05 '25 06:03 tridao

I made a package here which should be a drop-in replacement for Flash Attention 2 for Turing (with significant limitations):

https://github.com/rationalism/flash-attn-triton

pip install flash-attn-triton

this is a wrapper around a modified version of OpenAI's Triton Flash Attention to be compatible with the flash_attn API

note that I just published this and it is alpha-stage, please expect bugs

rationalism avatar Mar 26 '25 21:03 rationalism

Could you please confirm if the flash_attn_varlen_func function interface is implemented in the code? Thank you! @rationalism

hustlmh avatar May 28 '25 10:05 hustlmh