flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

great work! When to release flash attention 4?

Open moveforever opened this issue 2 months ago • 13 comments
trafficstars

moveforever avatar Aug 27 '25 02:08 moveforever

Soon, 3-4 weeks

tridao avatar Aug 27 '25 03:08 tridao

https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/flash_fwd_sm100.py Isn't this fa4?

jc19chaoj avatar Aug 28 '25 02:08 jc19chaoj

It's still a work in progress and not complete yet

tridao avatar Aug 28 '25 05:08 tridao

Soon, 3-4 weeks

Great!! Will backward kernel also be released?

retonym avatar Sep 01 '25 12:09 retonym

Yes

tridao avatar Sep 01 '25 14:09 tridao

Soon, 3-4 weeks

I believe now we can say 1-2 weeks? ^^

WingsOfPanda avatar Sep 09 '25 08:09 WingsOfPanda

At such low precision (fp4 on Blackwell), will it still be exact, or it will have some info loss, like in Sage Attention variants?

kabachuha avatar Sep 09 '25 14:09 kabachuha

Soon, 3-4 weeks

time is up😁

joy-seu avatar Sep 23 '25 23:09 joy-seu

Time is up!

BNAadministrator3 avatar Oct 23 '25 09:10 BNAadministrator3

Time is up!

Image

Minwellcym avatar Oct 24 '25 04:10 Minwellcym

🥹

puppetm4st3r avatar Oct 28 '25 18:10 puppetm4st3r

is flash attention 4 support sm120? @tridao thanks for your great contributions.

moveforever avatar Oct 30 '25 11:10 moveforever

is flash attention 4 support sm120? @tridao thanks for your great contributions.

Sm120 is for RTX 50 series GPUs, which have exactly the same architecture as previous RTX 30 and RTX 40 series, except for support for fp4 and the TMA. So Flash Attention 2 works well (maybe with a bit room of enhancement thanks to the TMA), and Flash Attention 3 or 4 DO NOT WORK ON SM120.

melonedo avatar Nov 01 '25 12:11 melonedo

@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.

Edit: oh and wgmma, will have to change the mma instructions issued.

wrmedford avatar Nov 12 '25 19:11 wrmedford

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

---Original--- From: @.> Date: Thu, Nov 13, 2025 03:08 AM To: @.>; Cc: @.@.>; Subject: Re: [Dao-AILab/flash-attention] great work! When to release flashattention 4? (Issue #1842)

wrmedford left a comment (Dao-AILab/flash-attention#1842)

@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

melonedo avatar Nov 13 '25 01:11 melonedo

so, when to release fa4? can't wait!

WingsOfPanda avatar Nov 13 '25 08:11 WingsOfPanda

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.

wrmedford avatar Nov 13 '25 15:11 wrmedford

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.

Oh that would be exciting! In that case it would certainly be interesting to see how it maximizes the compute on 5090s. A lot of people consider sm120 to be close to sm100, so I made the previous statement in the hope of clarification. I apologize if it is incorrect.

melonedo avatar Nov 13 '25 15:11 melonedo

Hi Team,

Thank you for your great work on FlashAttention.

I am interested in deploying vLLM and SGLang on Thor, and I would like to ask about the timeline of FlashAttention for supporting sm_120 or sm_121. Also, is it feasible to deploy FlashAttention on Thor?

Thank you for your help!

cwh83118 avatar Nov 18 '25 08:11 cwh83118