llama.cpp Feature Request: Kimi Linear model (Kimi Delta Attention)

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Support Kimi Linear architecture models such as moonshotai/Kimi-Linear-48B-A3B-Instruct

Motivation

It a gud model, what can I say :) It also preemptively adds support for an architecture and attention method that Moonshot devs have hinted at using in their next big model; see i.e. https://x.com/bigeagle_xd/status/1983911519541981247

Possible Implementation

Likely blocked for now by the work going on in #16095, as the token mixing mechanism used (Kimi Delta Attention) is a variant of the Gated Deltanet used in Qwen 3 Next. See also the technical report for more details on it

Nov 02 '25 04:11 fizzAI

This is what gpt-5 reported of the python code inspection of kda commit into flash-linear-attention repo (maybe useful) and what kernel it produced when asked (likely bugged): https://chatgpt.com/share/69088b9d-7260-800f-abe6-e0efc26baf4d

Nov 03 '25 11:11 mattepiu

Seconding this, please make it happen for both CUDA and MLX/MPS

Nov 14 '25 06:11 BradKML

There's a patch. Fix FLA import errors https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/commit/d64a5299ded33ab2609617e05f6bd2cf9b6eef35

Nov 17 '25 05:11 rankaiyx