lmdeploy
lmdeploy copied to clipboard
optimize paged attention on triton3
triton3 has move the cuda fast math location. This PR support fast expf in paged attention with triton3.0.
[!NOTE]
None-cuda backend end might not work.
The fill kv kernel and attention is updated so we can change kv layout in the future.