Tri Dao
Tri Dao
Do you want to send a PR to fix?
The code should accept alibi slopes of shape either {nheads} or {batch_size, nheads}. The code is here: https://github.com/Dao-AILab/flash-attention/blob/4d9ba4f018cca5c8ca6c6f1df08fea75f119b06d/csrc/flash_attn/flash_api.cpp#L341
https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/flash_fwd_kernel.h#L1080
One call to Philox RNG gives 128 random bits: https://github.com/Dao-AILab/flash-attention/blob/a93359a2bfdedfcd054622e6f595f99d7a23c17e/csrc/flash_attn/src/philox.cuh#L31 We use 8 random bits to generate one dropout mask: https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/dropout.h#L46 So each thread can do dropout on 16 elements...
That makes sense. Similarly if you try matrix multiply with some dimension being 132 instead of 128 you'll see speed difference. Most implementation would implicitly pad the dimensions to be...
Yes, I'll relax those tests.
No, FA2 is already close to optimal on A100
A100 runs the same code as before.
72 should work. Please provide a script reproducing the error on FA2 side (i.e. no vLLM)
Please provide a script reproducing the error on FA2 side