Tri Dao comments

Results 639 comments of


                                            Tri Dao

Questions regarding `a` and `x` in backward CUDA kernel code

IIRC `a` stores exp(delta_p * A_val), or maybe the product of such terms up to position p. You should work out mathematically what `thread_data[i].y` is. It's the second component of...

Triton Error [CUDA]: device kernel image is invalid

It's a triton error, idk how to fix it but you can search triton repo issues

Triton Error [CUDA]: device kernel image is invalid

You can try upgrading pytorch, though I don't think Triton support V100 very well in general

Flash Attention 3 3090 support

> does flash attention 3 works on RTX 3000 series now? FA3 now works on Ampere, Ada, and Hopper. So RTX 3000 series should work (those are Ampere).

Flash Attention 3 3090 support

Yes

Cute-DSL bwd R2P masking doesn't work on H200

Backward masking is different. It's typically transposed (since we typically do K @ Q^T in the backward instead of Q @ K^T).

Update setup.py

rtx 5090 is sm120 and it's already included. Why would removing sm100 help?

Does it support a stable diffusion model

If stable diffusion uses attention, then yes

No support for 4D attention? `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)`

We're starting have FlexAttention implemented on top of FA4, so that should eventually work for this case.

No support for 4D attention? `RuntimeError: cu_seqlens_q must have shape (batch_size + 1)`

plz check the flexattn tests to see if any of those apply to your case