Tri Dao comments

Results 642 comments of


                                            Tri Dao

No support for 4D attention? `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)`

i mean flexattn in this repo (flash_attn.cute) https://github.com/Dao-AILab/flash-attention/blob/main/tests/cute/test_score_mod.py

No support for 4D attention? `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)`

in general 4D attn mask isn't the right abstraction for prefix-lm (4d mask is too general and you'll pay for it with slow down). @drisspg do we have example of...

if q,k is bf16 dtype，does q*kT output dtype need to be fp32

We choose to have q @ K^T in fp32 for better numerical stability.

if q,k is bf16 dtype，does q*kT output dtype need to be fp32

you can try that out

Flash attn 3 has large numerical mismatches with torch spda

FA uses (batch, seqlen, nheads, headdim). Torch sdpa expects (batch, nheads, seqlen, headdim).

Flash attn 3 has large numerical mismatches with torch spda

sdpa is probably just running FA2 :D

Flash attn 3 has large numerical mismatches with torch spda

As always, you want to check against a reference implementation: (flashattention in bf16 - reference impl in fp32) vs (reference impl in bf16 - reference impl in fp32).

Flash attn 3 has large numerical mismatches with torch spda

There's no guarantee of bitwise identical results for two different implementations since floating point maths are not associative ``` In [1]: import torch In [2]: a = torch.randn(10, dtype=torch.bfloat16, device='cuda')...

Problems with the installation of flash_attn with pytorch 2.8.0+cu128

We have new wheels for torch 2.8 now

High memory requirements when compiling

Yep i'd love to understand why compile takes so much memory. We do use a lot of templating, but I don't quite get how that translates to a very large...