Tri Dao comments

Results 639 comments of


                                            Tri Dao

Correctness of `flash_attn_varlen_func` kernel with cuda graph.

max_seqlen_k is a variable on CPU. After the kernel is capture, changing this value will have no effect.

Correctness of `flash_attn_varlen_func` kernel with cuda graph.

It's similar to other variables on CPU, such as softmax_scale. If the kernel is captured with softmax_scale = 1.0, then after that if you change softmax_scale to 2.0 and replay...

Correctness of `flash_attn_varlen_func` kernel with cuda graph.

You're trying to change a CPU variable after capturing CUDA graph, that's not supported by CUDA graph. I haven't looked closely but looks like in this case the kernel is...

[BUG] EpilogueTileAuto doesn't work when tile shape is (128, 112, 64)

You'd want ``` auto tile_n = cute::gcd(cute::min(_32{}, size(TileShape_MNK{})), size(TileShape_MNK{})); ```

Adding additive attention mask support to Triton FlashAttention

Wonderful work on the Triton implementation, and very thoughtful suggestions here. Thanks @janEbert! Yes, I'd love to stay up to date with upstream Triton, I just haven't had time to...

Adding additive attention mask support to Triton FlashAttention

> I can take care of some of the integrations from upstream to here if you're fine with losing backward-compatibility. The attention mask/bias will probably not be integrated upstream due...

Adding additive attention mask support to Triton FlashAttention

> Sorry, I've just edited the post above: My only worry is having to figure out the workarounds that had to be implemented here. Were they necessary to support the...

WHEN can we get the flash-attention 2.x for Turing GPU ?

It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.

WHEN can we get the flash-attention 2.x for Turing GPU ?

It depends on folks contributing to make it work for Turing.

Add support for qk hidden dim different from v hidden dim

Thanks for this contribution. This is very impressive! However I think having different qk headdim and v headdim complicates the code and increases the maintenance workload. I believe it's better...