ramalama RFE: Add --reasoning-budget flag to control thinking in reasoning models

Summary

Request to expose llama.cpp's --reasoning-budget flag in ramalama serve to properly control reasoning/thinking behavior in models like DeepSeek-R1.

Background

llama.cpp added the --reasoning-budget flag (PR #13771) to address issues where reasoning models continue generating thinking tokens even when disabled
The flag supports: -1 (unrestricted, default) and 0 (disable thinking completely)
This flag is more effective than the older --thinking flag or enable_thinking: false API parameter

Current Situation

Ramalama 0.13.0 currently exposes --thinking THINKING flag
The underlying llama-server in the container does support --reasoning-budget (verified with llama-server --help)
However, --thinking 0 does not effectively prevent DeepSeek-R1 from generating reasoning tokens
Result: Users cannot disable thinking even when explicitly requested, wasting inference time

Test Case

# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answering

With logs showing hundreds of reasoning_content chunks being emitted despite --thinking 0.

Proposed Solution

Add a --reasoning-budget flag to ramalama serve that passes through to llama-server:

ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latest

Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.

Benefits

Users can properly control reasoning model behavior
Aligns with upstream llama.cpp best practices
Fixes known limitation with DeepSeek-R1 and similar reasoning models
Improves inference efficiency when thinking is not desired

References

llama.cpp issues: #13160, #13189, #15401
llama.cpp PR: #13771
llama.cpp commit: e121edc

Environment

Ramalama: 0.13.0-1.fc42
Fedora: 42
llama-server version in container: b52edd2

Nov 11 '25 17:11 csoriano2718

@csoriano2718 in the meantime ramalama serve --runtime-args='--reasoning-budget 0' ... should work

Nov 12 '25 10:11 olliewalsh

@rhatdan @engelmi in general I think we need to make this more generic. The CLI will get very noisy if we to replicate every llama.cpp arg in the ramalama command line parser, and then every vLLM arg, and then every.... I think instead we should make this data-driven via the inference spec yaml.

Nov 12 '25 10:11 olliewalsh

Just took a closer look...

Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.

That is the current behaviour: https://github.com/containers/ramalama/blob/main/inference-spec/engines/llama.cpp.yaml#L35

Can confirm this by running ramalama --debug serve --thinking 0 ollama://library/deepseek-r1:latest which will log the lama-server command line:

llama-server --host 0.0.0.0 --port 8080 --model /mnt/models/deepseek-r1 --chat-template-file /mnt/models/chat_template_extracted --jinja --no-warmup --reasoning-budget 0 --alias library/deepseek-r1 --temp 0.8 --cache-reuse 256 -v -ngl 999 --threads 12 --log-colors on

Nov 12 '25 12:11 olliewalsh

I agree, we want to make this more generic, and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.

Nov 12 '25 14:11 rhatdan

and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.

See my most recent comment: --thinking already maps to this llama-server arg. I'll need to take a closer look at llama.cpp, perhaps --reasoning-budget doesn't work correctly with this model.

Nov 12 '25 14:11 olliewalsh

@olliewalsh thanks for the suggestion on --runtime-args. I've read in the MR that some models might still not behave as expected even when forcing a budget or limiting thinking, which might explain that.

I wonder if this issue should be closed then, seems to work as expected from Ramalama's perspective.

Nov 14 '25 09:11 csoriano2718