aphrodite-engine
aphrodite-engine copied to clipboard
[Feature]: Support for Triton attention backend for inference
🚀 The feature, motivation and pitch
Currently, PagedAttention only supports specific head_size values. This prevents models like Magistral 2509 (with a head_size of 160) from running. vLLM resolves this by using Triton as the inference backend instead of PagedAttention in these situations.
I recommend providing Triton as an alternative in situations where PagedAttention is not suitable for running a model.
Alternatives
Don't support a range of models with head_size values unsupported by PagedAttention.
Additional context
No response