Tri Dao comments

Results 639 comments of


                                            Tri Dao

FA3 unit test fails

Those are fine, i'll relax the tests

Does GPU experience load imbalance when dealing with different KVcache length decoding query?

Yes that's an issue. One approach to address that is to adapt the stream-K method (https://arxiv.org/abs/2301.03598) originally developed for matmul. Lean Attention (https://arxiv.org/abs/2405.10480) does that but afaik Lean Attention didn't...

Is it possible to access the intermediate calculation of q * k multiplication with Flash Attention?

No. But you can just calculate that with pytorch.

benchmark is low on B200

Right we haven't implemented a version for blackwell. What's running is using old instructions for Ampere.

benchmark is low on B200

We have a forward pass for B200 now: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/interface.py

Cannot compile on win10, 16gb ram, out of memory

Did you try MAX_JOBS=1? We can compile this with github runners with 16GB for linux, though idk about windows.

About the setting of bs_seqlen_vals in the benchmark_flash_attention.py

Nothing special, you can set them however you like to benchmark. Typically for language modeling one would increase the seqlen and decrease the batch size to maintain the same number...

About the setting of bs_seqlen_vals in the benchmark_flash_attention.py

You can read the function doc string and and the tests https://github.com/Dao-AILab/flash-attention/blob/main/tests/test_flash_attn.py

Mamba2 Causality

How large is the difference?

Mamba2 Causality

> > How large is the difference? > > About 1e-7 to 1e-6, I'm suspecting it is due to some floating point precision rather than the model itself. That's probably...