flash-attention great work! When to release flash attention 4?

trafficstars

Aug 27 '25 02:08 moveforever

Soon, 3-4 weeks

Aug 27 '25 03:08 tridao

https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/flash_fwd_sm100.py Isn't this fa4?

Aug 28 '25 02:08 jc19chaoj

It's still a work in progress and not complete yet

Aug 28 '25 05:08 tridao

Soon, 3-4 weeks

Great!! Will backward kernel also be released?

Sep 01 '25 12:09 retonym

Yes

Sep 01 '25 14:09 tridao

Soon, 3-4 weeks

I believe now we can say 1-2 weeks? ^^

Sep 09 '25 08:09 WingsOfPanda

At such low precision (fp4 on Blackwell), will it still be exact, or it will have some info loss, like in Sage Attention variants?

Sep 09 '25 14:09 kabachuha

Soon, 3-4 weeks

time is up😁

Sep 23 '25 23:09 joy-seu

Time is up！

Oct 23 '25 09:10 BNAadministrator3

Time is up！

Oct 24 '25 04:10 Minwellcym

🥹

Oct 28 '25 18:10 puppetm4st3r

is flash attention 4 support sm120? @tridao thanks for your great contributions.

Oct 30 '25 11:10 moveforever

is flash attention 4 support sm120? @tridao thanks for your great contributions.

Sm120 is for RTX 50 series GPUs, which have exactly the same architecture as previous RTX 30 and RTX 40 series, except for support for fp4 and the TMA. So Flash Attention 2 works well (maybe with a bit room of enhancement thanks to the TMA), and Flash Attention 3 or 4 DO NOT WORK ON SM120.

Nov 01 '25 12:11 melonedo

@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.

Edit: oh and wgmma, will have to change the mma instructions issued.

Nov 12 '25 19:11 wrmedford

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

---Original--- From: @.> Date: Thu, Nov 13, 2025 03:08 AM To: @.>; Cc: @.@.>; Subject: Re: [Dao-AILab/flash-attention] great work! When to release flashattention 4? (Issue #1842)

wrmedford left a comment (Dao-AILab/flash-attention#1842)

@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Nov 13 '25 01:11 melonedo

so, when to release fa4? can't wait!

Nov 13 '25 08:11 WingsOfPanda

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.

Nov 13 '25 15:11 wrmedford

The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.

Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.

Oh that would be exciting! In that case it would certainly be interesting to see how it maximizes the compute on 5090s. A lot of people consider sm120 to be close to sm100, so I made the previous statement in the hope of clarification. I apologize if it is incorrect.

Nov 13 '25 15:11 melonedo

Hi Team,

Thank you for your great work on FlashAttention.

I am interested in deploying vLLM and SGLang on Thor, and I would like to ask about the timeline of FlashAttention for supporting sm_120 or sm_121. Also, is it feasible to deploy FlashAttention on Thor?

Thank you for your help!

Nov 18 '25 08:11 cwh83118

flash-attention flash-attention copied to clipboard

great work! When to release flash attention 4?

flash-attention
flash-attention copied to clipboard