[Feature]: Support for Triton attention backend for inference

Open stepfunction83 opened this issue 2 months ago • 0 comments

🚀 The feature, motivation and pitch

Currently, PagedAttention only supports specific head_size values. This prevents models like Magistral 2509 (with a head_size of 160) from running. vLLM resolves this by using Triton as the inference backend instead of PagedAttention in these situations.

I recommend providing Triton as an alternative in situations where PagedAttention is not suitable for running a model.

Alternatives

Don't support a range of models with head_size values unsupported by PagedAttention.

Additional context

No response

Oct 09 '25 15:10 stepfunction83