flash-attention
flash-attention copied to clipboard
Add support for Cuda 12.8 and B200 GPUs
When trying to run on B200 - i'm getting:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
May I ask if there is any plan to support B200 GPU?
There's plan but it'll take a while.
There's plan but it'll take a while.
Great to hear! Can you provide an estimated timeline? Thanks!!
No we don't commit to a public timeline. It really depends on how much folks are contributing their time
FA3 support for blackwell would be so great to have
We're working on Blackwell
Hi, any updates on Blackwell?
It's coming along. Meanwhile you can use either cuDNN or the cutlass implementation: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
Great to hear! Will it also support the backward pass?
Yes
Yes
2.5.0 ?
Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...
Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...
I can use flash attention in GH200, GB200 and b200. Some kernels must be updated to support blackwell but be patient
It uses Ampere instructions and is slow. All I am asking for is timeline and suggestions how to help, this should be really like 5 minute effort to add in this thread... This way people can plan accordingly rather than just waiting cluelessly.
@tridao you should ask for donations
We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha
@tridao do you plan support aarch64 natively? https://github.com/Dao-AILab/flash-attention/pull/1507 i can execute your repository on gh200, jetson orin and thor and now on gb200. For cuda 13 will be with spark available
Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon
Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon
thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work
Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon
thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work
How is performance on gh200?
Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon
thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work
How is performance on gh200?
so good... it's fast
Still no B200?
https://github.com/Dao-AILab/flash-attention/blob/b517a592049ed81a4cf9ad3aa4b4a7372e9d9a56/flash_attn/cute/flash_fwd_sm100.py
Thanks! Sorry this is a stupid question.
But to use it on b200s, what would i have to do? I followed this:
cd hopper
python setup.py install
But i get the following error, when i run benchmark_attn.py:
flash_fwd_launch_template.h:180): no kernel image is available for execution on the device
Any help on how to setup flashattn on b200 would be very helpful
Thanks! Sorry this is a stupid question.
But to use it on b200s, what would i have to do? I followed this:
cd hopper python setup.py installBut i get the following error, when i run benchmark_attn.py:
flash_fwd_launch_template.h:180): no kernel image is available for execution on the deviceAny help on how to setup flashattn on b200 would be very helpful
I compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2
Thanks! Sorry this is a stupid question.
But to use it on b200s, what would i have to do? I followed this:
cd hopper python setup.py installBut i get the following error, when i run benchmark_attn.py:
flash_fwd_launch_template.h:180): no kernel image is available for execution on the deviceAny help on how to setup flashattn on b200 would be very helpful
For B200 you'd need to install nvidia-cutlass-dsl, and the interface is here:
https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py
You can call it like how the test is calling it:
https://github.com/Dao-AILab/flash-attention/blob/525fb4323bc0d2a02b640a1f8a9d5c48a5c59f1b/tests/cute/test_flash_attn.py#L161
Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:
cd hopper python setup.py installBut i get the following error, when i run benchmark_attn.py:
flash_fwd_launch_template.h:180): no kernel image is available for execution on the deviceAny help on how to setup flashattn on b200 would be very helpfulFor B200 you'd need to install
nvidia-cutlass-dsl, and the interface is here: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py You can call it like how the test is calling it:flash-attention/tests/cute/test_flash_attn.py
Line 161 in 525fb43
out, lse = flash_attn_func(
Thanks @tridao
nvidia-cutlass-dsl (https://pypi.org/project/nvidia-cutlass-dsl/) doesn't seem have a wheel for aarch64,
Is there a way i could go about installing flashattn/nvidia-cutlass-dsl on b200s it? Thanks again
I'm hearing aarch64 wheels will be coming soon (on the order of weeks).
Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:
cd hopper python setup.py installBut i get the following error, when i run benchmark_attn.py:
flash_fwd_launch_template.h:180): no kernel image is available for execution on the deviceAny help on how to setup flashattn on b200 would be very helpfulI compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2
Thanks @johnnynunez , are the speeds better than the ones reported here -- https://github.com/Dao-AILab/flash-attention/issues/1589, i am able to get flashattn installed on b200s, just was wondering if there is a faster version than the ampere version.
I'm hearing aarch64 wheels will be coming soon (on the order of weeks).
I create this PR that it solves that, but I hope that Nvidia send you a GB200 like SGLang and Vllm team. https://github.com/Dao-AILab/flash-attention/pull/1507