flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

Add support for Cuda 12.8 and B200 GPUs

Open ofirkris opened this issue 9 months ago • 52 comments
trafficstars

When trying to run on B200 - i'm getting: RuntimeError: FlashAttention only supports Ampere GPUs or newer.

ofirkris avatar Jan 25 '25 19:01 ofirkris

May I ask if there is any plan to support B200 GPU?

wangli1426 avatar Mar 01 '25 15:03 wangli1426

There's plan but it'll take a while.

tridao avatar Mar 04 '25 17:03 tridao

There's plan but it'll take a while.

Great to hear! Can you provide an estimated timeline? Thanks!!

ofirkris avatar Mar 05 '25 21:03 ofirkris

No we don't commit to a public timeline. It really depends on how much folks are contributing their time

tridao avatar Mar 05 '25 21:03 tridao

FA3 support for blackwell would be so great to have

cassanof avatar Mar 22 '25 07:03 cassanof

We're working on Blackwell

tridao avatar Mar 22 '25 20:03 tridao

Hi, any updates on Blackwell?

beginlner avatar Apr 17 '25 03:04 beginlner

It's coming along. Meanwhile you can use either cuDNN or the cutlass implementation: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

tridao avatar Apr 17 '25 21:04 tridao

Great to hear! Will it also support the backward pass?

beginlner avatar Apr 18 '25 00:04 beginlner

Yes

tridao avatar Apr 18 '25 01:04 tridao

Yes

2.5.0 ?

johnnynunez avatar Apr 20 '25 19:04 johnnynunez

Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...

PiotrDabkowski avatar May 02 '25 11:05 PiotrDabkowski

Is there any update on this? PyTorch already supports Blackwell. Regarding flash-attention support we are completely in the dark - no PR, no timeline, no suggestions on how to help...

I can use flash attention in GH200, GB200 and b200. Some kernels must be updated to support blackwell but be patient

johnnynunez avatar May 02 '25 16:05 johnnynunez

It uses Ampere instructions and is slow. All I am asking for is timeline and suggestions how to help, this should be really like 5 minute effort to add in this thread... This way people can plan accordingly rather than just waiting cluelessly.

PiotrDabkowski avatar May 06 '25 09:05 PiotrDabkowski

@tridao you should ask for donations

bhaktatejas922 avatar Jun 03 '25 21:06 bhaktatejas922

We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

tridao avatar Jun 03 '25 21:06 tridao

We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

@tridao do you plan support aarch64 natively? https://github.com/Dao-AILab/flash-attention/pull/1507 i can execute your repository on gh200, jetson orin and thor and now on gb200. For cuda 13 will be with spark available

johnnynunez avatar Jun 04 '25 10:06 johnnynunez

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

tridao avatar Jun 04 '25 16:06 tridao

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

johnnynunez avatar Jun 04 '25 16:06 johnnynunez

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

How is performance on gh200?

bhaktatejas922 avatar Jun 04 '25 16:06 bhaktatejas922

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon

thanks! @tridao for answer it, i have it on jetson (tegra) and gh200. Seems that GB200 i can compile it and works but the performance is bad and it is normal (upcoming support). Thank you for your work

How is performance on gh200?

so good... it's fast

johnnynunez avatar Jun 04 '25 17:06 johnnynunez

Still no B200?

ehartford avatar Jun 29 '25 06:06 ehartford

https://github.com/Dao-AILab/flash-attention/blob/b517a592049ed81a4cf9ad3aa4b4a7372e9d9a56/flash_attn/cute/flash_fwd_sm100.py

tridao avatar Jun 30 '25 05:06 tridao

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

klabplab avatar Jul 04 '25 04:07 klabplab

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

I compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2

johnnynunez avatar Jul 04 '25 16:07 johnnynunez

Thanks! Sorry this is a stupid question.

But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py:

flash_fwd_launch_template.h:180): no kernel image is available for execution on the device

Any help on how to setup flashattn on b200 would be very helpful

For B200 you'd need to install nvidia-cutlass-dsl, and the interface is here: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py You can call it like how the test is calling it: https://github.com/Dao-AILab/flash-attention/blob/525fb4323bc0d2a02b640a1f8a9d5c48a5c59f1b/tests/cute/test_flash_attn.py#L161

tridao avatar Jul 04 '25 16:07 tridao

Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py: flash_fwd_launch_template.h:180): no kernel image is available for execution on the device Any help on how to setup flashattn on b200 would be very helpful

For B200 you'd need to install nvidia-cutlass-dsl, and the interface is here: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py You can call it like how the test is calling it:

flash-attention/tests/cute/test_flash_attn.py

Line 161 in 525fb43

out, lse = flash_attn_func(

Thanks @tridao

nvidia-cutlass-dsl (https://pypi.org/project/nvidia-cutlass-dsl/) doesn't seem have a wheel for aarch64,

Is there a way i could go about installing flashattn/nvidia-cutlass-dsl on b200s it? Thanks again

klabplab avatar Jul 04 '25 18:07 klabplab

I'm hearing aarch64 wheels will be coming soon (on the order of weeks).

tridao avatar Jul 04 '25 18:07 tridao

Thanks! Sorry this is a stupid question. But to use it on b200s, what would i have to do? I followed this:

cd hopper
python setup.py install

But i get the following error, when i run benchmark_attn.py: flash_fwd_launch_template.h:180): no kernel image is available for execution on the device Any help on how to setup flashattn on b200 would be very helpful

I compile it for gh200 and it is working well https://pypi.jetson-ai-lab.dev/sbsa/cu129/flash-attn/2.8.0.post2

Thanks @johnnynunez , are the speeds better than the ones reported here -- https://github.com/Dao-AILab/flash-attention/issues/1589, i am able to get flashattn installed on b200s, just was wondering if there is a faster version than the ampere version.

klabplab avatar Jul 04 '25 18:07 klabplab

I'm hearing aarch64 wheels will be coming soon (on the order of weeks).

I create this PR that it solves that, but I hope that Nvidia send you a GB200 like SGLang and Vllm team. https://github.com/Dao-AILab/flash-attention/pull/1507

johnnynunez avatar Jul 04 '25 19:07 johnnynunez