llama.cpp Feature Request: Kimi-K2-Thinking reasoning and tool calling support

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Support the tool calling (function calling) of Kimi-K2 series natively, including Kimi-K2-Thinking and maybe also Kimi-K2-Instruct.

Motivation

Tool calling: Kimi-K2-Thinking's model card said it can "maintaining stable tool-use across 200–300 sequential calls", but we currently have no support on it, falling back to the generic json method.
Reasoning: Currently we must use --special to make thinking work as said in Unsloth Documentation.

Possible Implementation

vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py

ik_llama.cpp had implemented it before, but now they are using mainline function calling: https://github.com/ikawrakow/ik_llama.cpp/pull/628

I'm trying to implement it at https://github.com/KiruyaMomochi/llama.cpp/tree/kimi-k2-thinking, by copying DeepSeek-V3.1's implementation in a silly way. However, Kimi-K2 seems to have different function name syntax than DeepSeek. I also get an extra <|tool_calls_section_end|> token, maybe due to the --special flag.

Nov 10 '25 21:11 KiruyaMomochi

Updated fork, somehow worked but seems really strange to me. Still not sure if I should open a PR or not. Have not checked if tool calling during thinking is allowed, but the model seems to call a tool directly after </think>, caused the thinking context lost in the following assistant message...

Nov 11 '25 18:11 KiruyaMomochi

Hey @KiruyaMomochi. I wrote the latest iteration of the DS 3.1 tool calling code. I wrote a ton of unit tests. Strongly recommend writing your own in the same style if you haven't already. They were very helpful to me.

In the case of DS 3.1, I found the model commonly did things the spec said it shouldn't do, but I chose to support what the model was doing rather than what the spec said - otherwise it would have been unusable. HTH.

Nov 11 '25 18:11 createthis

Another solution: https://github.com/ggml-org/llama.cpp/pull/16932

Nov 13 '25 11:11 hksdpc255

https://github.com/ggml-org/llama.cpp/pull/16932 sort of works, but in my testing with Open Hands it keeps stopping for some reason. I have to type "continue" constantly and it gets stuck in repetitive loops. I haven't taken the time to debug it yet.

Nov 26 '25 20:11 createthis

#16932 sort of works, but in my testing with Open Hands it keeps stopping for some reason. I have to type "continue" constantly and it gets stuck in repetitive loops. I haven't taken the time to debug it yet.

I experienced the exact same issue with the models constantly stopping. It happened with Qwen3-coder-30b, GLM4.5 Air and Minimax-m2. I also happened with Opencode & Codex. Given the fact that combined with your experiences we now have 4 models & 3 agentic coding software exhibiting the same issues, it's likely a llamacpp bug.

I shared the issues I was experiencing in that same topic you linked to: https://github.com/ggml-org/llama.cpp/pull/16932

Nov 26 '25 21:11 Mushoz