Tri Dao comments

Results 640 comments of


                                            Tri Dao

FlashAttention-3 is only supported on CUDA 12.3 and above but torch.version = 2.5.0+cu124

Then FA3 won't work for your case.

[Cute,Fwd,Sm100] Support `q_stage=1` for inference

Clarifying my understanding: if q_stage == 1, is there overlapping between softmax and mma?

fa3 for ROCm

Sure, would love to see contributions there

Support sliding window attention in FA3

There's a PR on that

correctness of `window_size`

As mentioned, in general it's not a good idea to use equality to compare floating points. ``` In [11]: a = torch.randn(10, dtype=torch.bfloat16, device='cuda') In [12]: torch.equal(a + 0.3 -...

correctness of `window_size`

There are 2 code paths, one for local and one for causal. There's no guarantee that they produce identical outputs.

correctness of `window_size`

There's some code to detect that some local window size is equivalent to causal and run the causal path instead (since causal is faster). That's probably causing what you're observing....

[ci]: Update ubuntu and pytorch

We've just updated them

FlashAttention3 forward producing NaN output when NaN exist in parts of input data that it should not be reading

Yes, for varlen the kernel will read the V beyond the NTOKENS (it reads blocks of e.g. 128 tokens at a time). Typically this is ok because we then multiply...

FA2' flash_attn_varlen_func is 300x slower than flash_attn_func

Please don't use time.time() to measure time. CUDA operations are async. You can use torch benchmark. https://pytorch.org/tutorials/recipes/recipes/benchmark.html