flash-attention
flash-attention copied to clipboard
great work! When to release flash attention 4?
Soon, 3-4 weeks
https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/flash_fwd_sm100.py Isn't this fa4?
It's still a work in progress and not complete yet
Soon, 3-4 weeks
Great!! Will backward kernel also be released?
Yes
Soon, 3-4 weeks
I believe now we can say 1-2 weeks? ^^
At such low precision (fp4 on Blackwell), will it still be exact, or it will have some info loss, like in Sage Attention variants?
Soon, 3-4 weeks
time is up😁
Time is up!
Time is up!
🥹
is flash attention 4 support sm120? @tridao thanks for your great contributions.
is flash attention 4 support sm120? @tridao thanks for your great contributions.
Sm120 is for RTX 50 series GPUs, which have exactly the same architecture as previous RTX 30 and RTX 40 series, except for support for fp4 and the TMA. So Flash Attention 2 works well (maybe with a bit room of enhancement thanks to the TMA), and Flash Attention 3 or 4 DO NOT WORK ON SM120.
@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.
Edit: oh and wgmma, will have to change the mma instructions issued.
The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.
---Original--- From: @.> Date: Thu, Nov 13, 2025 03:08 AM To: @.>; Cc: @.@.>; Subject: Re: [Dao-AILab/flash-attention] great work! When to release flashattention 4? (Issue #1842)
wrmedford left a comment (Dao-AILab/flash-attention#1842)
@melonedo sm120 is pretty similar to sm90 in that regard, just with significantly smaller SMEM. It might be possible to play with the shapes used in FA3 to support it on sm120 since a lot of what FA3 did was introduce TMA support. Otherwise outside of L1/SMEM capacity and FP4 support, unless I'm missing something else, sm120 is nearly identical to sm90.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
so, when to release fa4? can't wait!
The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.
Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.
The biggest problem of sm120 is that its it has the same mma performance as previous graphics cards: 1/4 of Hopper, and 1/8 if using fp32 accimulator. As a result, sm120 has the same compute-memory-control ratio as previous graphics cards, so whatever works on 3090 also works on 5090, and TMA may help, but it is not as important as it is in Hopper.
Block scaled data types on 5090s seem to be significantly faster than expected based on some results I've seen in perf groups I'm in (mxfp8 specifically), so I wouldn't write this direction off out of hand without empirical testing.
Oh that would be exciting! In that case it would certainly be interesting to see how it maximizes the compute on 5090s. A lot of people consider sm120 to be close to sm100, so I made the previous statement in the hope of clarification. I apologize if it is incorrect.
Hi Team,
Thank you for your great work on FlashAttention.
I am interested in deploying vLLM and SGLang on Thor, and I would like to ask about the timeline of FlashAttention for supporting sm_120 or sm_121. Also, is it feasible to deploy FlashAttention on Thor?
Thank you for your help!