optimize paged attention on triton3

Open grimoire opened this issue 1 year ago • 0 comments

triton3 has move the cuda fast math location. This PR support fast expf in paged attention with triton3.0.

[!NOTE]
None-cuda backend end might not work.

The fill kv kernel and attention is updated so we can change kv layout in the future.

Oct 08 '24 03:10 grimoire