Feature Request: Kimi-K2-Thinking reasoning and tool calling support
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Support the tool calling (function calling) of Kimi-K2 series natively, including Kimi-K2-Thinking and maybe also Kimi-K2-Instruct.
Motivation
- Tool calling: Kimi-K2-Thinking's model card said it can "maintaining stable tool-use across 200–300 sequential calls", but we currently have no support on it, falling back to the generic json method.
- Reasoning: Currently we must use
--specialto make thinking work as said in Unsloth Documentation.
Possible Implementation
vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/kimi_k2_tool_parser.py
ik_llama.cpp had implemented it before, but now they are using mainline function calling: https://github.com/ikawrakow/ik_llama.cpp/pull/628
I'm trying to implement it at https://github.com/KiruyaMomochi/llama.cpp/tree/kimi-k2-thinking, by copying DeepSeek-V3.1's implementation in a silly way. However, Kimi-K2 seems to have different function name syntax than DeepSeek. I also get an extra <|tool_calls_section_end|> token, maybe due to the --special flag.
Updated fork, somehow worked but seems really strange to me. Still not sure if I should open a PR or not. Have not checked if tool calling during thinking is allowed, but the model seems to call a tool directly after </think>, caused the thinking context lost in the following assistant message...
Hey @KiruyaMomochi. I wrote the latest iteration of the DS 3.1 tool calling code. I wrote a ton of unit tests. Strongly recommend writing your own in the same style if you haven't already. They were very helpful to me.
In the case of DS 3.1, I found the model commonly did things the spec said it shouldn't do, but I chose to support what the model was doing rather than what the spec said - otherwise it would have been unusable. HTH.
Another solution: https://github.com/ggml-org/llama.cpp/pull/16932
https://github.com/ggml-org/llama.cpp/pull/16932 sort of works, but in my testing with Open Hands it keeps stopping for some reason. I have to type "continue" constantly and it gets stuck in repetitive loops. I haven't taken the time to debug it yet.
#16932 sort of works, but in my testing with Open Hands it keeps stopping for some reason. I have to type "continue" constantly and it gets stuck in repetitive loops. I haven't taken the time to debug it yet.
I experienced the exact same issue with the models constantly stopping. It happened with Qwen3-coder-30b, GLM4.5 Air and Minimax-m2. I also happened with Opencode & Codex. Given the fact that combined with your experiences we now have 4 models & 3 agentic coding software exhibiting the same issues, it's likely a llamacpp bug.
I shared the issues I was experiencing in that same topic you linked to: https://github.com/ggml-org/llama.cpp/pull/16932