Tri Dao
Tri Dao
Those are fine, i'll relax the tests
Yes that's an issue. One approach to address that is to adapt the stream-K method (https://arxiv.org/abs/2301.03598) originally developed for matmul. Lean Attention (https://arxiv.org/abs/2405.10480) does that but afaik Lean Attention didn't...
No. But you can just calculate that with pytorch.
Right we haven't implemented a version for blackwell. What's running is using old instructions for Ampere.
We have a forward pass for B200 now: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py
Did you try MAX_JOBS=1? We can compile this with github runners with 16GB for linux, though idk about windows.
Nothing special, you can set them however you like to benchmark. Typically for language modeling one would increase the seqlen and decrease the batch size to maintain the same number...
You can read the function doc string and and the tests https://github.com/Dao-AILab/flash-attention/blob/main/tests/test_flash_attn.py
How large is the difference?
> > How large is the difference? > > About 1e-7 to 1e-6, I'm suspecting it is due to some floating point precision rather than the model itself. That's probably...