flash-attention Add support for Cuda 12.8 and B200 GPUs

trafficstars

When trying to run on B200 - i'm getting: RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Jan 25 '25 19:01 ofirkris

May I ask if there is any plan to support B200 GPU?

Mar 01 '25 15:03 wangli1426

There's plan but it'll take a while.

Mar 04 '25 17:03 tridao

There's plan but it'll take a while.

Great to hear! Can you provide an estimated timeline? Thanks!!

Mar 05 '25 21:03 ofirkris

No we don't commit to a public timeline. It really depends on how much folks are contributing their time

Mar 05 '25 21:03 tridao

FA3 support for blackwell would be so great to have

Mar 22 '25 07:03 cassanof

We're working on Blackwell

Mar 22 '25 20:03 tridao

Hi, any updates on Blackwell?

Apr 17 '25 03:04 beginlner

It's coming along. Meanwhile you can use either cuDNN or the cutlass implementation: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

Apr 17 '25 21:04 tridao

Great to hear! Will it also support the backward pass?

Apr 18 '25 00:04 beginlner

Yes

Apr 18 '25 01:04 tridao

Yes

2.5.0 ?

Apr 20 '25 19:04 johnnynunez

Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...

May 02 '25 11:05 PiotrDabkowski

Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...

I can use flash attention in GH200, GB200 and b200. Some kernels must be updated to support blackwell but be patient

May 02 '25 16:05 johnnynunez

It uses Ampere instructions and is slow. All I am asking for is timeline and suggestions how to help, this should be really like 5 minute effort to add in this thread... This way people can plan accordingly rather than just waiting cluelessly.

May 06 '25 09:05 PiotrDabkowski

@tridao you should ask for donations

Jun 03 '25 21:06 bhaktatejas922

We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

Jun 03 '25 21:06 tridao

We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

@tridao do you plan support aarch64 natively? https://github.com/Dao-AILab/flash-attention/pull/1507 i can execute your repository on gh200, jetson orin and thor and now on gb200. For cuda 13 will be with spark available

Jun 04 '25 10:06 johnnynunez

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

Jun 04 '25 16:06 tridao

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

Jun 04 '25 16:06 johnnynunez

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

How is performance on gh200?

Jun 04 '25 16:06 bhaktatejas922

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

How is performance on gh200?

so good... it's fast

Jun 04 '25 17:06 johnnynunez

Still no B200?

Jun 29 '25 06:06 ehartford

https://github.com/Dao-AILab/flash-attention/blob/b517a592049ed81a4cf9ad3aa4b4a7372e9d9a56/flash_attn/cute/flash_fwd_sm100.py

Jun 30 '25 05:06 tridao

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

Jul 04 '25 04:07 klabplab

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:
cd hopper
python setup.py install
But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

I compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2

Jul 04 '25 16:07 johnnynunez

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:
cd hopper
python setup.py install
But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

For B200 you'd need to install nvidia-cutlass-dsl, and the interface is here: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py You can call it like how the test is calling it: https://github.com/Dao-AILab/flash-attention/blob/525fb4323bc0d2a02b640a1f8a9d5c48a5c59f1b/tests/cute/test_flash_attn.py#L161

Jul 04 '25 16:07 tridao

Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:
cd hopper
python setup.py install
But i get the following error, when i run benchmark_attn.py: flash_fwd_launch_template.h:180): no kernel image is available for execution on the device Any help on how to setup flashattn on b200 would be very helpful
For B200 you'd need to install nvidia-cutlass-dsl, and the interface is here: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py You can call it like how the test is calling it:

flash-attention/tests/cute/test_flash_attn.py

Line 161 in 525fb43

out, lse = flash_attn_func(

Thanks @tridao

nvidia-cutlass-dsl (https://pypi.org/project/nvidia-cutlass-dsl/) doesn't seem have a wheel for aarch64,

Is there a way i could go about installing flashattn/nvidia-cutlass-dsl on b200s it? Thanks again

Jul 04 '25 18:07 klabplab

I'm hearing aarch64 wheels will be coming soon (on the order of weeks).

Jul 04 '25 18:07 tridao

Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:
cd hopper
python setup.py install
But i get the following error, when i run benchmark_attn.py: flash_fwd_launch_template.h:180): no kernel image is available for execution on the device Any help on how to setup flashattn on b200 would be very helpful
I compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2

Thanks @johnnynunez , are the speeds better than the ones reported here -- https://github.com/Dao-AILab/flash-attention/issues/1589, i am able to get flashattn installed on b200s, just was wondering if there is a faster version than the ampere version.

Jul 04 '25 18:07 klabplab

I'm hearing aarch64 wheels will be coming soon (on the order of weeks).

I create this PR that it solves that, but I hope that Nvidia send you a GB200 like SGLang and Vllm team. https://github.com/Dao-AILab/flash-attention/pull/1507

Jul 04 '25 19:07 johnnynunez

flash-attention flash-attention copied to clipboard

Add support for Cuda 12.8 and B200 GPUs

flash-attention
flash-attention copied to clipboard