flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

FlashAttention forward support for Turing

Open ssiu opened this issue 8 months ago • 5 comments
trafficstars

Hi, I tried my hands on implementing flash attention forward pass for the Turing architecture. This is the repo:

https://github.com/ssiu/flash-attention-turing

As this is still an early implementation, it only supports:

  • head_dim = 128
  • vanilla attention i.e. no masking
  • seq_len must be divisible by 128

For batch_size = 4, num_heads = 32, head_dim = 128, our implementation is currently around 2x faster than Pytorch's F.scaled_dot_product_attention, which calls Memory-Efficient Attention in the backend. This was tested for T4.

Image

For long sequences it reaches around 63% compute throughput.

Image

Thanks!

ssiu avatar Mar 12 '25 12:03 ssiu

Wow this is great work!

tridao avatar Mar 13 '25 04:03 tridao

Thanks for this, I has been trying to find a way to get flash attention working on T4 GPU.

You help me so much.

OMS24 avatar Mar 16 '25 14:03 OMS24

This is great work! I am looking for this. Is there any new progress?

lwllvyb avatar Mar 21 '25 12:03 lwllvyb

Hi @lwllvyb, yes I will be continuing to work on this.

ssiu avatar Mar 21 '25 21:03 ssiu

Great, praise your work!

wangsen-zy avatar Apr 22 '25 08:04 wangsen-zy

@ssiu This is awesome! I can’t wait for this to be official. You’re doing incredible work - keep going, you’re amazing! 🙌💪

michaelsheka avatar Jun 15 '25 13:06 michaelsheka