RFE: Add --reasoning-budget flag to control thinking in reasoning models
Summary
Request to expose llama.cpp's --reasoning-budget flag in ramalama serve to properly control reasoning/thinking behavior in models like DeepSeek-R1.
Background
- llama.cpp added the
--reasoning-budgetflag (PR #13771) to address issues where reasoning models continue generating thinking tokens even when disabled - The flag supports:
-1(unrestricted, default) and0(disable thinking completely) - This flag is more effective than the older
--thinkingflag orenable_thinking: falseAPI parameter
Current Situation
- Ramalama 0.13.0 currently exposes
--thinking THINKINGflag - The underlying llama-server in the container does support
--reasoning-budget(verified withllama-server --help) - However,
--thinking 0does not effectively prevent DeepSeek-R1 from generating reasoning tokens - Result: Users cannot disable thinking even when explicitly requested, wasting inference time
Test Case
# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answering
With logs showing hundreds of reasoning_content chunks being emitted despite --thinking 0.
Proposed Solution
Add a --reasoning-budget flag to ramalama serve that passes through to llama-server:
ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latest
Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.
Benefits
- Users can properly control reasoning model behavior
- Aligns with upstream llama.cpp best practices
- Fixes known limitation with DeepSeek-R1 and similar reasoning models
- Improves inference efficiency when thinking is not desired
References
Environment
- Ramalama: 0.13.0-1.fc42
- Fedora: 42
- llama-server version in container: b52edd2
@csoriano2718 in the meantime ramalama serve --runtime-args='--reasoning-budget 0' ... should work
@rhatdan @engelmi in general I think we need to make this more generic. The CLI will get very noisy if we to replicate every llama.cpp arg in the ramalama command line parser, and then every vLLM arg, and then every.... I think instead we should make this data-driven via the inference spec yaml.
Just took a closer look...
Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.
That is the current behaviour: https://github.com/containers/ramalama/blob/main/inference-spec/engines/llama.cpp.yaml#L35
Can confirm this by running ramalama --debug serve --thinking 0 ollama://library/deepseek-r1:latest which will log the lama-server command line:
llama-server --host 0.0.0.0 --port 8080 --model /mnt/models/deepseek-r1 --chat-template-file /mnt/models/chat_template_extracted --jinja --no-warmup --reasoning-budget 0 --alias library/deepseek-r1 --temp 0.8 --cache-reuse 256 -v -ngl 999 --threads 12 --log-colors on
I agree, we want to make this more generic, and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.
and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.
See my most recent comment: --thinking already maps to this llama-server arg. I'll need to take a closer look at llama.cpp, perhaps --reasoning-budget doesn't work correctly with this model.
@olliewalsh thanks for the suggestion on --runtime-args. I've read in the MR that some models might still not behave as expected even when forcing a budget or limiting thinking, which might explain that.
I wonder if this issue should be closed then, seems to work as expected from Ramalama's perspective.