Tri Dao comments

Results 639 comments of


                                            Tri Dao

Does FA3 varlen func support pad between sequences?

If your padding tokens is only on Q and not on K & V, you can just pretend those are legit tokens and don't need seqused_q right? Then the output...

Does FA3 varlen func support pad between sequences?

If you have a kernel that zeros out the padding tokens (sth like `out[padding_indices, :, :] = 0.0`) then you could apply that to the output and the incoming gradient...

Performance Comparison of FA3 vs Fused Attention on H20 with nvcc12.4 Compilation

Yes there's quite a bit of perf difference. I'd recommend 12.8+.

[CuTe DSL] Block sparsity computation kernel

cc @drisspg @jayhshah

[QST] [CuTeDSL] Nsight Compute Profiler Link to Source Code

Just want to echo this, would make it much easier than just reading the SASS

[cute, fwd, sm100] Issue on B200 cute forward with Gemma3

Hdim 256 isn’t currently supported on sm100. We might get to it later

[cute, fwd, sm100] Issue on B200 cute forward with Gemma3

Prob 1-2 months. We're focusing on hdim 128 and hdim 192-128 (deepseek)

feat: Implement Sink Attention

It's better to add to existing interface instead of duplicating code

[Cute,Fwd,Sm90] Support KV cache

Is `self.tiles_per_page` a compile time constant? If so, we should add it to the `compile_key` in the interface

alibi_slopes with flash_attn_varlen_kvpacked_func

The implementation is here: https://github.com/Dao-AILab/flash-attention/blob/4d9ba4f018cca5c8ca6c6f1df08fea75f119b06d/csrc/flash_attn/src/alibi.h#L31 If causal, we add `alibi_slope * column_idx` to each element of the attention scores. If not causal, we add `alibi_slope * |row_idx - col_idx|`. The...