llama-cli: add support for reasoning
This change adds a "partial formatter" that processes partially collected messages (like the server streaming logic) in order to render reasoning logic prior to EOG token arrival.
In addition, the chat_add_and_format lambda has been moved to a functor, and this now calls common_chat_templates_apply directly to allow more robust template-application options.
Logic has been put in place to suppress the system/prompt tags to clean up output.
Example output :
./build/bin/llama-cli.exe -m ./models/gpt-oss-20b-mxfp4.gguf -c 2048 -sys "You are a wizard" -p "please recite me a haiku about llamas" --jinja -co
I just updated to clean up the system/prompt tags (see description changes), but I will await feedback before changing anything else! 😊
One thing I was contemplating was splitting the display block into a separate abstraction. The display could be a new type because more state was added here, it might be a good time to do refactors like this to encapsulate functionality incrementally.
Ack, I found an issue with the logic here. When part of the template strings match the "content" part it falsely matches. For example, "You are a wizard" when the template applied (below) will match against "You are ChatGPT". So I think it has to match the surrounding tokens exactly first.
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are a wizard
<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant
After testing a bit I found it was never reliable to get the system prompt tokens exactly back in all cases and opted to simply print the system prompt and user prompt content prior to jumping into the loop.
llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use. It is better to keep all special tags visible for testing/debugging purposes. In case of reasoning, it should be visibly separated from the rest of the answer, as @CISC has suggested - it's hard to understand where the reasoning is in the example screenshot you've posted.
It is better to keep all special tags visible for testing/debugging purposes.
Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.
Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling
--reasoning-budget.
If that's intended with jinja, then it's fine, but I would still suggest improving it in future. So long as LLMs can still hallucinate and have mismatched templates, it's always better to double-check.
llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use.
@MaggotHATE Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.
Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.
What "take two" will have is: (1) only a single toolcall.cpp/h inside the llama-cli project; (2) only support toolcalls via the stdio transport (because there are nice local nodejs proxies and so-forth).
This will add nice testability to the toolcalls.
Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.
Any long, continuous dialog with a model would provide and good understanding if is works correctly and generates all required special tokens; this is especially important with different sampling combinations and settings. For example, old Magistral used to have problems with its thinking tags, which should be fixed in 2509 (I have tested it briefly only, as the model works better without reasoning). Moreover, the idea of "hybrid" reasoning is still in the air, which makes differentiating and outlining reasoning portions of generated text even more important.
I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).
Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.
~~If I understood you correctly, I would advice against introducing any network-related features into llama-cli and for making a separate tool instead. As of right now, it is fully private, with no way to connect to a network, which is a guarantee. Changing that would make llama-cli potentially less secure/private.~~ Ah yes, that was changed with the remote downloading of models. Alas.
@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.
I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).
Do you render with the legacy templates or bypass templates altogether?
@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.
Thanks for explaining, I don't have first-hand experience with it and clearly misunderstood it. It will be interesting to have it in llama-cli as (probably) the most straightforward way to test MCP capabilities.
Do you render with the legacy templates or bypass templates altogether?
I use legacy-style templates in my own llama-cli-based program, mostly for convenience of controlling everything from one .json config. If I remember correctly, there is a similar idea of simple "profile" files for llama.cpp, and in such case .jinja templates would become less essential (you can store the template in that same file, along with sampling settings and models paths, for example). At the same time, ChatML, as the most popular template format, makes manual configuration almost pointless - it's too strict.
@CISC There is a race condition happening when the colors are changed using console::set_display(...); When this routine is called it sets the color immediately, but because llama-cli output is tightly coupled with the LOG macro, when log messages are queued using common_log_add(...), the output of the message itself is being processed later on.
I think we need to separate the main output from the log output. Any existing call to LOG(...) should write immediately, as it should only ever go to stdout. If callers for whatever reason wanted to redirect this, it should be done explicitly on the command-line.
I fixed the issue by adding a console::write routine. Please see description for a screenshot of the new formatting with blue for the reasoning content.
One caveat is just that the --log-disable will no longer disable output, but this is easy to fix with: llama-cli ... >/dev/null for folks who need to silence it.
Happy to discuss/make further adjustments. 😊
The guard against stripped reasoning is very nice, prevents crashes with several templates!
However something is not quite right, f.ex. with Qwen3-4B-Thinking-2507 the following happens on the second prompt (after initial -p):
[...]
151644 -> '<|im_start|>'
872 -> 'user'
198 -> '
'
151645 -> '<|im_end|>'
198 -> '
'
151644 -> '<|im_start|>'
77091 -> 'assistant'
198 -> '
'
151667 -> '<think>'
198 -> '
'
, the user just sent an empty message after my previous response. Hmm, I need to figure out what they want now.
Interesting. Did you type an empty message as second input? Any chance you would provide the full command history/conversation transcript? I will work on debugging that model.
Interesting. Did you type an empty message as second input? Any chance you would provide the full command history/conversation transcript? I will work on debugging that model.
I merely gave it an initial prompt with -p, then a follow-up prompt (no, not empty of course :) ) once it was done.
Edit: it was bartowski/Qwen_Qwen3-4B-Thinking-2507-GGUF
@CISC The issue with the Qwen models was that chat_formatter was calling comon_chat_parse on user messages and the content was being placed into "reasoning content"—hence the empty content! Thanks for catching the issue. It should be fixed in latest when you get time to check it.
@CISC The issue with the Qwen models was that chat_formatter was calling
comon_chat_parseon user messages and the content was being placed into "reasoning content"—hence the empty content! Thanks for catching the issue. It should be fixed in latest when you get time to check it.
Yep, works great now, though I noticed it also swallows the first token (Okay in this case) in the response (then and now), but I think that may be a separate issue (the Hermes 2 Pro one too):
n_past = 3103
Parsing input with format Hermes 2 Pro: Okay
n_remain: -3084
eval: [ 'Okay':32313 ]
n_past = 3104
Parsing input with format Hermes 2 Pro: Okay,
,n_remain: -3085
eval: [ ',':11 ]
n_past = 3105
Parsing input with format Hermes 2 Pro: Okay, the
then_remain: -3086
eval: [ ' the':279 ]
n_past = 3106
Parsing input with format Hermes 2 Pro: Okay, the user
usern_remain: -3087
eval: [ ' user':1196 ]
Edit: Yeah, it doesn't do that with --reasoning-budget 0, quite likely due to Hermes 2 Pro reasoning format.
Interesting I was seeing the stripped token yesterday but I am not seeing it today... Perhaps after merging master in here? In any case, just want to double check and make sure this is intended for enabling reasoning as it is set now:
cinputs.enable_thinking =
params.use_jinja &&
params.reasoning_budget != 0 &&
common_chat_templates_support_enable_thinking(chat_templates.get());
I set --reasoning-budget 0 and it outputs the reasoning as a regular assistant message. Is this expected? I am using $ ./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048 --reasoning-budget 0
./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048
I set --reasoning-budget 0 and it outputs the reasoning as a regular assistant message. Is this expected? I am using
$ ./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048 --reasoning-budget 0
No, that should disable thinking (it did for me).
I stepped through with Qwen a bit more and found the issue with the initial token was a problem with the partial_formatter. It had stale data from the previous reasoning, and the initial reasoning was the same "Okay, ..." which matched the next reasoning. I added a clear method that gets called on a new round when user input is processed in the chat_formatter to fix the issue.
In addition, I added "Thinking ..." prefix and "...\n\n" suffix, but I am open to changing those. Another possibility could be something like: "[Thinking: ... ]" which seems maybe easier to see, since the models tend to output ... more frequently than square brackets.
In addition, I added "Thinking ..." prefix and "...\n\n" suffix, but I am open to changing those. Another possibility could be something like: "[Thinking: ... ]" which seems maybe easier to see, since the models tend to output ... more frequently than square brackets.
Yeah, it really needs to stand out from regular output, that's hard to accomplish though, I was toying with the idea of perhaps just a simple < before reasoning and regular output as opposed to the user > .
Hmm. Perhaps we just leave it Thinking ... ... for now as it is the most "natural language" way; I would imagine folks will use color anyhow. In the future if there's a reason for concern we can change it. 😉
EDIT: I will create a couple screenshots we can use for comparison.
@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:
Logic/Math symbols (most thematically appropriate):
∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions ∵ (U+2235) - "Because" symbol - good for premises/reasoning ⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails" ⇒ (U+21D2) - Double arrow - implies/entails
General delimiters (widely compatible):
§ (U+00A7) - Section sign - traditional formal marker ¶ (U+00B6) - Pilcrow - paragraph/section marker ※ (U+203B) - Reference mark - attention/note marker ⁂ (U+2042) - Asterism - decorative section break ◆ (U+25C6) - Black diamond ▸ (U+25B8) - Small triangle - often used for disclosure/expansion
@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:
Logic/Math symbols (most thematically appropriate):
∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions ∵ (U+2235) - "Because" symbol - good for premises/reasoning ⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails" ⇒ (U+21D2) - Double arrow - implies/entails
Sorry for the slow response. The double arrow is perhaps not a bad one...
@CISC No worries on the delay! Merge conflicts on llama-cli should be minimal :)
Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.
Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.
I think it should also be prepended to the regular output to better mark the separation, maybe even colored green to match the input prompt.
Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this, but then again we can't easily restore the thinking tokens either...
Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this
Hmm. I think the usual way to handle this would be to write extra delimiters to stderr, so then outputs would just be redirected in that case: llama-cli ... --single-turn 2>/dev/null > conversation.txt. Is that something we would want to support in the console as well?
EDIT: So the idea is that the user calls llama-cli with --single-turn to get one chat iteration and redirects to a file in a non-interactive session while keeping the interactive mode. In other modes it wouldn't apply the reasoning-partial formatter. It might be fine to keep the double-arrow in this case too because there is no way with the current formatting to parse the trailing end of the reasoning, so it would be a human (or an AI agent I guess).
I think it should also be prepended to the regular output to better mark the separation
How do you mean? Please show an example.
Here is a more explicit version which would allow something to parse the reasoning after the fact while lending slightly more human-readability than the regular templates. Reading it though I prefer the concise double-arrow format, but if we need parsing capability or something we'd have to consider this sort of option.
And then building on it, with tool calls it would be something like [Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP for llama-cli instead of calling Tcl procedure.
I think it should also be prepended to the regular output to better mark the separation
How do you mean? Please show an example.
I simply mean the following (to copy your example):
⇒ The user asks: ...
⇒ Because ...
>
And then building on it, with tool calls it would be something like
[Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP forllama-cliinstead of calling Tcl procedure.
Yeah, at that point it would certainly make more sense to have some very explicit delimiters like that.
Okay so to summarize some of these ideas:
- Prefix the reasoning block with an "indicator" like ⇒.
- Prefix all reasoning content with ⇒.
- Wrap reasoning in "normalized" delimiters like "[Reasoning: ... ]".
- Prefix with "Thinking... " and possible suffix like "..." (which I suppose is the same as 3 but harder to read).
Of these methods the ones that can be parsed supposing a conversation is written to a file would be (2), (3) and (4) or some variant of that. Option (1) is the most minimal for an interactive conversation, but impossible to parse from a file, and it takes a little more cognitive load to separate the reasoning from actual response. Option (2) would probably be easiest to see visually the entire block of reasoning, but somewhat verbose; writing to a file in this case would be able to parse the reasoning blocks line-by-line, which would work well.
With color enabled none of this matters, so we're talking mainly about (a) running interactively without color; (b) running with --single-turn chat and sending the output to a file.