benchmark
benchmark copied to clipboard
Persistent version of Flash Attention
Added two more variants: triton_tutorial_flash_v2_persistent and triton_tutorial_flash_v2_persistent_tma The variants handle non-causal only. For causal, it has 2 invocations to attn_fwd_inner, which means we will have an outerloop and 2 inner loops for ... # persistent loop for ... for ... It is not clear how to flatten it into a 1D loop.