Tri Dao
Tri Dao
IIRC `a` stores exp(delta_p * A_val), or maybe the product of such terms up to position p. You should work out mathematically what `thread_data[i].y` is. It's the second component of...
It's a triton error, idk how to fix it but you can search triton repo issues
You can try upgrading pytorch, though I don't think Triton support V100 very well in general
> does flash attention 3 works on RTX 3000 series now? FA3 now works on Ampere, Ada, and Hopper. So RTX 3000 series should work (those are Ampere).
Backward masking is different. It's typically transposed (since we typically do K @ Q^T in the backward instead of Q @ K^T).
rtx 5090 is sm120 and it's already included. Why would removing sm100 help?
If stable diffusion uses attention, then yes
We're starting have FlexAttention implemented on top of FA4, so that should eventually work for this case.
plz check the flexattn tests to see if any of those apply to your case