Tri Dao

Results 639 comments of Tri Dao

The code should accept alibi slopes of shape either {nheads} or {batch_size, nheads}. The code is here: https://github.com/Dao-AILab/flash-attention/blob/4d9ba4f018cca5c8ca6c6f1df08fea75f119b06d/csrc/flash_attn/flash_api.cpp#L341

https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/flash_fwd_kernel.h#L1080

One call to Philox RNG gives 128 random bits: https://github.com/Dao-AILab/flash-attention/blob/a93359a2bfdedfcd054622e6f595f99d7a23c17e/csrc/flash_attn/src/philox.cuh#L31 We use 8 random bits to generate one dropout mask: https://github.com/Dao-AILab/flash-attention/blob/c4b9015d74bd9f638c6fd574482accf4bbbd4197/csrc/flash_attn/src/dropout.h#L46 So each thread can do dropout on 16 elements...

That makes sense. Similarly if you try matrix multiply with some dimension being 132 instead of 128 you'll see speed difference. Most implementation would implicitly pad the dimensions to be...

72 should work. Please provide a script reproducing the error on FA2 side (i.e. no vLLM)

Please provide a script reproducing the error on FA2 side