Feature Request: Kimi Linear model (Kimi Delta Attention)
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Support Kimi Linear architecture models such as moonshotai/Kimi-Linear-48B-A3B-Instruct
Motivation
It a gud model, what can I say :) It also preemptively adds support for an architecture and attention method that Moonshot devs have hinted at using in their next big model; see i.e. https://x.com/bigeagle_xd/status/1983911519541981247
Possible Implementation
Likely blocked for now by the work going on in #16095, as the token mixing mechanism used (Kimi Delta Attention) is a variant of the Gated Deltanet used in Qwen 3 Next. See also the technical report for more details on it
This is what gpt-5 reported of the python code inspection of kda commit into flash-linear-attention repo (maybe useful) and what kernel it produced when asked (likely bugged): https://chatgpt.com/share/69088b9d-7260-800f-abe6-e0efc26baf4d
Seconding this, please make it happen for both CUDA and MLX/MPS
There's a patch. Fix FLA import errors https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/commit/d64a5299ded33ab2609617e05f6bd2cf9b6eef35