Tri Dao

Results 429 comments of Tri Dao
trafficstars

FA3 doesn't have dropout

Not any time soon unless someone volunteers to work on it.

The right thing to compare to is standard attention in fp32. In this case FlashAttention is actually **more** accurate than the standard implementation in fp16: ``` torch.manual_seed(0) batch_size = 1...

Looks like a Triton error

Idk how the HF version is implemented. We recommend the version in this repo.

Oh it just hasn't been tested very well. dq semaphore should work except for hdim256. I'm not sure dk & dv (when GQA) semaphores have worked yet.

One issue I can see is that in the backward pass, if lse = +inf then exp(qk - lse) returns 0, which is what we want. If lse = -inf...

Are you using the latest commit? There's a recent update to enable causal for the backward. Can you profile to get the time for the attention kernel?

This commit: https://github.com/Dao-AILab/flash-attention/commit/bafe253042fb251a28f351ad0a2657da26263f31