Tri Dao comments

Results 639 comments of


                                            Tri Dao

alibi_slopes with flash_attn_varlen_kvpacked_func

Do you want to send a PR to fix?

alibi_slopes with flash_attn_varlen_kvpacked_func

The code should accept alibi slopes of shape either {nheads} or {batch_size, nheads}. The code is here: https://github.com/Dao-AILab/flash-attention/blob/4d9ba4f018cca5c8ca6c6f1df08fea75f119b06d/csrc/flash_attn/flash_api.cpp#L341

[QST] How flash-attn calc the dropout？

https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/flash_fwd_kernel.h#L1080

[QST] How flash-attn calc the dropout？

One call to Philox RNG gives 128 random bits: https://github.com/Dao-AILab/flash-attention/blob/a93359a2bfdedfcd054622e6f595f99d7a23c17e/csrc/flash_attn/src/philox.cuh#L31 We use 8 random bits to generate one dropout mask: https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/dropout.h#L46 So each thread can do dropout on 16 elements...

The impact of setting head dim to an unconventional value on the algorithm's running speed

That makes sense. Similarly if you try matrix multiply with some dimension being 132 instead of 128 you'll see speed difference. Most implementation would implicitly pad the dimensions to be...

FA3 setup failed: flash_bwd_hdim128_bf16_sm90.o flash_fwd_hdim128_bf16_sm90.o ,... compilation terminated.

Yes, I'll relax those tests.

Is there a difference in computation speed between FlashAttention-3 and FlashAttention-2 when executed on an A100 GPU?"

No, FA2 is already close to optimal on A100

Is there a difference in computation speed between FlashAttention-3 and FlashAttention-2 when executed on an A100 GPU?"

A100 runs the same code as before.

Cannot use FlashAttention-2 backend for head size 72.

72 should work. Please provide a script reproducing the error on FA2 side (i.e. no vLLM)

Cannot use FlashAttention-2 backend for head size 72.

Please provide a script reproducing the error on FA2 side