llama.cpp Misc. bug: tool calls are broken

Name and Version

Why would anyone implement the syntax error checking of the escaped json inside the llm response in a way that does not work? What was the point?

see more info: https://github.com/ikawrakow/ik_llama.cpp/issues/750

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

// TODO

First Bad Commit

No response

Relevant log output

Oct 04 '25 10:10 magikRUKKOLA

@magikRUKKOLA Yes, I can confirm from my template implementation that this code is buggy as hell. From my personal experience, the only reliable workaround has been to disable the partial tool streaming whatsoever - leave partial streaming, but for tool calls only stream then when the entire tool call has been parsed. I will sit down for a refactor of that buggy mess (together with adding some sufficiently complex test cases to catch most of the culprits) when I'm done with Qwen3Next.

Oct 12 '25 16:10 pwilkin

Basically, if you want to do what I did and get a quick working, if somewhat UX-unfriendly solution, you can look at what I did in common_chat_parse_nemotron_v2.

Oct 12 '25 17:10 pwilkin

@pwilkin

but for tool calls only stream then when the entire tool call has been parsed.

We both know that this is not a solution. The whole point of streaming be it a regular llm response or the tool call, is to get the output tokens as soon as possible. For example if the user of the llm sees that llm tries to do some stupid shit in the tool call, it would be logical to cancel the llm response rightaway and add some clarifications to the initial prompt. Now we are in the situation when such a simple functionality is not implemented lol. The code needs to be rewrtitten asap. Moreover there supposed to be some tests that would run the certain llm quant with a certain seed to make sure the tool calls functionality is working as intended. Otherwise, the code is not production-ready at all. This is very sad.

Oct 12 '25 21:10 magikRUKKOLA

Yeah, as I said, I'm aware of this, but the Qwen3 Next conversion is proving to be extremely time consuming, to say the least. We don't really need tests on live models, but more robust tests for streaming would certainly be required.

Oct 12 '25 21:10 pwilkin

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nov 27 '25 01:11 github-actions[bot]