llama.cpp llama-cli: add support for reasoning

This change adds a "partial formatter" that processes partially collected messages (like the server streaming logic) in order to render reasoning logic prior to EOG token arrival.

In addition, the chat_add_and_format lambda has been moved to a functor, and this now calls common_chat_templates_apply directly to allow more robust template-application options.

Logic has been put in place to suppress the system/prompt tags to clean up output.

Example output :

./build/bin/llama-cli.exe -m ./models/gpt-oss-20b-mxfp4.gguf -c 2048 -sys "You are a wizard" -p "please recite me a haiku about llamas" --jinja -co

Oct 16 '25 01:10 bandoti

I just updated to clean up the system/prompt tags (see description changes), but I will await feedback before changing anything else! 😊

One thing I was contemplating was splitting the display block into a separate abstraction. The display could be a new type because more state was added here, it might be a good time to do refactors like this to encapsulate functionality incrementally.

Oct 16 '25 13:10 bandoti

Ack, I found an issue with the logic here. When part of the template strings match the "content" part it falsely matches. For example, "You are a wizard" when the template applied (below) will match against "You are ChatGPT". So I think it has to match the surrounding tokens exactly first.

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a wizard

<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant

Oct 17 '25 14:10 bandoti

After testing a bit I found it was never reliable to get the system prompt tokens exactly back in all cases and opted to simply print the system prompt and user prompt content prior to jumping into the loop.

Oct 19 '25 02:10 bandoti

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use. It is better to keep all special tags visible for testing/debugging purposes. In case of reasoning, it should be visibly separated from the rest of the answer, as @CISC has suggested - it's hard to understand where the reasoning is in the example screenshot you've posted.

Oct 19 '25 12:10 MaggotHATE

It is better to keep all special tags visible for testing/debugging purposes.

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

Oct 19 '25 12:10 CISC

Keeping the tags would be hard, I don't think it's much of an issue as long as we have visual separation, the main improvement here is enabling --reasoning-budget.

If that's intended with jinja, then it's fine, but I would still suggest improving it in future. So long as LLMs can still hallucinate and have mismatched templates, it's always better to double-check.

Oct 19 '25 13:10 MaggotHATE

llama-cli exists not only for chatting, but also for testing models on a more "real-life scenario" use.

@MaggotHATE Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

What "take two" will have is: (1) only a single toolcall.cpp/h inside the llama-cli project; (2) only support toolcalls via the stdio transport (because there are nice local nodejs proxies and so-forth).

This will add nice testability to the toolcalls.

Oct 19 '25 14:10 bandoti

Any chance you would provide an example of the intended testing scenario? Testing of course provides a nice angle having features in llama-cli that complement the server, which might not want those capabilities built in.

Any long, continuous dialog with a model would provide and good understanding if is works correctly and generates all required special tokens; this is especially important with different sampling combinations and settings. For example, old Magistral used to have problems with its thinking tags, which should be fixed in 2509 (I have tested it briefly only, as the model works better without reasoning). Moreover, the idea of "hybrid" reasoning is still in the air, which makes differentiating and outlining reasoning portions of generated text even more important.

I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).

Side note: after getting this reasoning in I am going to revisit the tool-call capabilities (as this PR implements much of the required foundation). Part of my initial attempt was too complicated—especially when MCP added OAuth handshakes to the HTTP SSE transport, to me it doesn't make sense to add such complexity and that is the realm of a scripting language.

~~If I understood you correctly, I would advice against introducing any network-related features into llama-cli and for making a separate tool instead. As of right now, it is fully private, with no way to connect to a network, which is a guarantee. Changing that would make llama-cli potentially less secure/private.~~ Ah yes, that was changed with the remote downloading of models. Alas.

Oct 19 '25 15:10 MaggotHATE

@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.

I don't use Jinja, but my understanding is that it would only "render" correct combinations of tags - still, being able to actually see the entire template would be helpful for testing (maybe an arg?).

Do you render with the legacy templates or bypass templates altogether?

Oct 20 '25 20:10 bandoti

@MaggotHATE The MCP stdio transport basically execs a process and opens stdin/stdout channel between it. So it will amount to user specifying one-or-more command-lines to run. And if folks want to use HTTP/SSE there are "adapter" programs that can proxy the local request/responses to HTTP/SSE (if they so desired). That means there is no networking built-in, but the capability is 100% there already using some nodejs apps and so-forth.

Thanks for explaining, I don't have first-hand experience with it and clearly misunderstood it. It will be interesting to have it in llama-cli as (probably) the most straightforward way to test MCP capabilities.

Do you render with the legacy templates or bypass templates altogether?

I use legacy-style templates in my own llama-cli-based program, mostly for convenience of controlling everything from one .json config. If I remember correctly, there is a similar idea of simple "profile" files for llama.cpp, and in such case .jinja templates would become less essential (you can store the template in that same file, along with sampling settings and models paths, for example). At the same time, ChatML, as the most popular template format, makes manual configuration almost pointless - it's too strict.

Oct 21 '25 09:10 MaggotHATE

@CISC There is a race condition happening when the colors are changed using console::set_display(...); When this routine is called it sets the color immediately, but because llama-cli output is tightly coupled with the LOG macro, when log messages are queued using common_log_add(...), the output of the message itself is being processed later on.

I think we need to separate the main output from the log output. Any existing call to LOG(...) should write immediately, as it should only ever go to stdout. If callers for whatever reason wanted to redirect this, it should be done explicitly on the command-line.

Oct 23 '25 16:10 bandoti

I fixed the issue by adding a console::write routine. Please see description for a screenshot of the new formatting with blue for the reasoning content.

One caveat is just that the --log-disable will no longer disable output, but this is easy to fix with: llama-cli ... >/dev/null for folks who need to silence it.

Happy to discuss/make further adjustments. 😊

Oct 23 '25 18:10 bandoti

The guard against stripped reasoning is very nice, prevents crashes with several templates!

However something is not quite right, f.ex. with Qwen3-4B-Thinking-2507 the following happens on the second prompt (after initial -p):

[...]

151644 -> '<|im_start|>'
   872 -> 'user'
   198 -> '
'
151645 -> '<|im_end|>'
   198 -> '
'
151644 -> '<|im_start|>'
 77091 -> 'assistant'
   198 -> '
'
151667 -> '<think>'
   198 -> '
'
, the user just sent an empty message after my previous response. Hmm, I need to figure out what they want now.

Oct 25 '25 11:10 CISC

Interesting. Did you type an empty message as second input? Any chance you would provide the full command history/conversation transcript? I will work on debugging that model.

Oct 25 '25 17:10 bandoti

Interesting. Did you type an empty message as second input? Any chance you would provide the full command history/conversation transcript? I will work on debugging that model.

I merely gave it an initial prompt with -p, then a follow-up prompt (no, not empty of course :) ) once it was done.

Edit: it was bartowski/Qwen_Qwen3-4B-Thinking-2507-GGUF

Oct 25 '25 18:10 CISC

@CISC The issue with the Qwen models was that chat_formatter was calling comon_chat_parse on user messages and the content was being placed into "reasoning content"—hence the empty content! Thanks for catching the issue. It should be fixed in latest when you get time to check it.

Oct 27 '25 13:10 bandoti

@CISC The issue with the Qwen models was that chat_formatter was calling comon_chat_parse on user messages and the content was being placed into "reasoning content"—hence the empty content! Thanks for catching the issue. It should be fixed in latest when you get time to check it.

Yep, works great now, though I noticed it also swallows the first token (Okay in this case) in the response (then and now), but I think that may be a separate issue (the Hermes 2 Pro one too):

n_past = 3103
Parsing input with format Hermes 2 Pro: Okay
n_remain: -3084
eval: [ 'Okay':32313 ]
n_past = 3104
Parsing input with format Hermes 2 Pro: Okay,
,n_remain: -3085
eval: [ ',':11 ]
n_past = 3105
Parsing input with format Hermes 2 Pro: Okay, the
 then_remain: -3086
eval: [ ' the':279 ]
n_past = 3106
Parsing input with format Hermes 2 Pro: Okay, the user
 usern_remain: -3087
eval: [ ' user':1196 ]

Edit: Yeah, it doesn't do that with --reasoning-budget 0, quite likely due to Hermes 2 Pro reasoning format.

Oct 27 '25 19:10 CISC

Interesting I was seeing the stripped token yesterday but I am not seeing it today... Perhaps after merging master in here? In any case, just want to double check and make sure this is intended for enabling reasoning as it is set now:

        cinputs.enable_thinking =
            params.use_jinja &&
            params.reasoning_budget != 0 &&
            common_chat_templates_support_enable_thinking(chat_templates.get());

I set --reasoning-budget 0 and it outputs the reasoning as a regular assistant message. Is this expected? I am using $ ./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048 --reasoning-budget 0

./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048

Oct 28 '25 12:10 bandoti

I set --reasoning-budget 0 and it outputs the reasoning as a regular assistant message. Is this expected? I am using $ ./build/bin/llama-cli.exe -m Qwen_Qwen3-4B-Thinking-2507-Q8_0.gguf -p "What is nine plus three?" --jinja -co -c 2048 --reasoning-budget 0

No, that should disable thinking (it did for me).

Oct 28 '25 13:10 CISC

I stepped through with Qwen a bit more and found the issue with the initial token was a problem with the partial_formatter. It had stale data from the previous reasoning, and the initial reasoning was the same "Okay, ..." which matched the next reasoning. I added a clear method that gets called on a new round when user input is processed in the chat_formatter to fix the issue.

In addition, I added "Thinking ..." prefix and "...\n\n" suffix, but I am open to changing those. Another possibility could be something like: "[Thinking: ... ]" which seems maybe easier to see, since the models tend to output ... more frequently than square brackets.

Oct 28 '25 13:10 bandoti

In addition, I added "Thinking ..." prefix and "...\n\n" suffix, but I am open to changing those. Another possibility could be something like: "[Thinking: ... ]" which seems maybe easier to see, since the models tend to output ... more frequently than square brackets.

Yeah, it really needs to stand out from regular output, that's hard to accomplish though, I was toying with the idea of perhaps just a simple < before reasoning and regular output as opposed to the user > .

Oct 28 '25 13:10 CISC

Hmm. Perhaps we just leave it Thinking ... ... for now as it is the most "natural language" way; I would imagine folks will use color anyhow. In the future if there's a reason for concern we can change it. 😉

EDIT: I will create a couple screenshots we can use for comparison.

@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:

Logic/Math symbols (most thematically appropriate):

∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions ∵ (U+2235) - "Because" symbol - good for premises/reasoning ⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails" ⇒ (U+21D2) - Double arrow - implies/entails

General delimiters (widely compatible):

§ (U+00A7) - Section sign - traditional formal marker ¶ (U+00B6) - Pilcrow - paragraph/section marker ※ (U+203B) - Reference mark - attention/note marker ⁂ (U+2042) - Asterism - decorative section break ◆ (U+25C6) - Black diamond ▸ (U+25B8) - Small triangle - often used for disclosure/expansion

Oct 28 '25 13:10 bandoti

@CISC What do you think of these? If we want something terse, maybe a specific glyph might be best to convey the meaning:

Logic/Math symbols (most thematically appropriate):

∴ (U+2234) - "Therefore" symbol - perfect for reasoning/conclusions ∵ (U+2235) - "Because" symbol - good for premises/reasoning ⊢ (U+22A2) - Turnstile - used in logic for "proves" or "entails" ⇒ (U+21D2) - Double arrow - implies/entails

Sorry for the slow response. The double arrow is perhaps not a bad one...

Nov 08 '25 09:11 CISC

@CISC No worries on the delay! Merge conflicts on llama-cli should be minimal :)

Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.

Nov 08 '25 14:11 bandoti

Here are a few of the screenshots. I tend to agree that the double-arrow has the right contextual meaning and sufficient visual prominence. The other symbols kind of sink into the background a bit.

I think it should also be prepended to the regular output to better mark the separation, maybe even colored green to match the input prompt.

Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this, but then again we can't easily restore the thinking tokens either...

Nov 08 '25 14:11 CISC

Now, the trick is, if a user redirects output to a file we probably shouldn't be messing with the output like this

Hmm. I think the usual way to handle this would be to write extra delimiters to stderr, so then outputs would just be redirected in that case: llama-cli ... --single-turn 2>/dev/null > conversation.txt. Is that something we would want to support in the console as well?

EDIT: So the idea is that the user calls llama-cli with --single-turn to get one chat iteration and redirects to a file in a non-interactive session while keeping the interactive mode. In other modes it wouldn't apply the reasoning-partial formatter. It might be fine to keep the double-arrow in this case too because there is no way with the current formatting to parse the trailing end of the reasoning, so it would be a human (or an AI agent I guess).

I think it should also be prepended to the regular output to better mark the separation

How do you mean? Please show an example.

Nov 08 '25 15:11 bandoti

Here is a more explicit version which would allow something to parse the reasoning after the fact while lending slightly more human-readability than the regular templates. Reading it though I prefer the concise double-arrow format, but if we need parsing capability or something we'd have to consider this sort of option.

And then building on it, with tool calls it would be something like [Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP for llama-cli instead of calling Tcl procedure.

Nov 08 '25 15:11 bandoti

I think it should also be prepended to the regular output to better mark the separation

How do you mean? Please show an example.

I simply mean the following (to copy your example):

⇒ The user asks: ...

⇒ Because ...

>

Nov 08 '25 18:11 CISC

And then building on it, with tool calls it would be something like [Calling tool: get_weather]. I actually have that implemented with a Tcl interpreter already (pending a much-delayed release). But this would use MCP for llama-cli instead of calling Tcl procedure.

Yeah, at that point it would certainly make more sense to have some very explicit delimiters like that.

Nov 08 '25 18:11 CISC

Okay so to summarize some of these ideas:

Prefix the reasoning block with an "indicator" like ⇒.
Prefix all reasoning content with ⇒.
Wrap reasoning in "normalized" delimiters like "[Reasoning: ... ]".
Prefix with "Thinking... " and possible suffix like "..." (which I suppose is the same as 3 but harder to read).

Of these methods the ones that can be parsed supposing a conversation is written to a file would be (2), (3) and (4) or some variant of that. Option (1) is the most minimal for an interactive conversation, but impossible to parse from a file, and it takes a little more cognitive load to separate the reasoning from actual response. Option (2) would probably be easiest to see visually the entire block of reasoning, but somewhat verbose; writing to a file in this case would be able to parse the reasoning blocks line-by-line, which would work well.

With color enabled none of this matters, so we're talking mainly about (a) running interactively without color; (b) running with --single-turn chat and sending the output to a file.

Nov 10 '25 18:11 bandoti