ramalama icon indicating copy to clipboard operation
ramalama copied to clipboard

RFE: Add --reasoning-budget flag to control thinking in reasoning models

Open csoriano2718 opened this issue 2 months ago • 6 comments

Summary

Request to expose llama.cpp's --reasoning-budget flag in ramalama serve to properly control reasoning/thinking behavior in models like DeepSeek-R1.

Background

  • llama.cpp added the --reasoning-budget flag (PR #13771) to address issues where reasoning models continue generating thinking tokens even when disabled
  • The flag supports: -1 (unrestricted, default) and 0 (disable thinking completely)
  • This flag is more effective than the older --thinking flag or enable_thinking: false API parameter

Current Situation

  • Ramalama 0.13.0 currently exposes --thinking THINKING flag
  • The underlying llama-server in the container does support --reasoning-budget (verified with llama-server --help)
  • However, --thinking 0 does not effectively prevent DeepSeek-R1 from generating reasoning tokens
  • Result: Users cannot disable thinking even when explicitly requested, wasting inference time

Test Case

# Current behavior with --thinking 0
$ ramalama serve --port 8080 --thinking 0 ollama://library/deepseek-r1:latest
# Query: "What is 2+2?"
# Result: Still generates 200+ reasoning_content chunks before answering

With logs showing hundreds of reasoning_content chunks being emitted despite --thinking 0.

Proposed Solution

Add a --reasoning-budget flag to ramalama serve that passes through to llama-server:

ramalama serve --port 8080 --reasoning-budget 0 ollama://library/deepseek-r1:latest

Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.

Benefits

  • Users can properly control reasoning model behavior
  • Aligns with upstream llama.cpp best practices
  • Fixes known limitation with DeepSeek-R1 and similar reasoning models
  • Improves inference efficiency when thinking is not desired

References

  • llama.cpp issues: #13160, #13189, #15401
  • llama.cpp PR: #13771
  • llama.cpp commit: e121edc

Environment

  • Ramalama: 0.13.0-1.fc42
  • Fedora: 42
  • llama-server version in container: b52edd2

csoriano2718 avatar Nov 11 '25 17:11 csoriano2718

@csoriano2718 in the meantime ramalama serve --runtime-args='--reasoning-budget 0' ... should work

olliewalsh avatar Nov 12 '25 10:11 olliewalsh

@rhatdan @engelmi in general I think we need to make this more generic. The CLI will get very noisy if we to replicate every llama.cpp arg in the ramalama command line parser, and then every vLLM arg, and then every.... I think instead we should make this data-driven via the inference spec yaml.

olliewalsh avatar Nov 12 '25 10:11 olliewalsh

Just took a closer look...

Alternative: Update the existing --thinking flag to internally use --reasoning-budget instead of the legacy parameter.

That is the current behaviour: https://github.com/containers/ramalama/blob/main/inference-spec/engines/llama.cpp.yaml#L35

Can confirm this by running ramalama --debug serve --thinking 0 ollama://library/deepseek-r1:latest which will log the lama-server command line:

llama-server --host 0.0.0.0 --port 8080 --model /mnt/models/deepseek-r1 --chat-template-file /mnt/models/chat_template_extracted --jinja --no-warmup --reasoning-budget 0 --alias library/deepseek-r1 --temp 0.8 --cache-reuse 256 -v -ngl 999 --threads 12 --log-colors on

olliewalsh avatar Nov 12 '25 12:11 olliewalsh

I agree, we want to make this more generic, and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.

rhatdan avatar Nov 12 '25 14:11 rhatdan

and I would rather manipulate the existing --thinking option to do the new behavior rather then add a new option.

See my most recent comment: --thinking already maps to this llama-server arg. I'll need to take a closer look at llama.cpp, perhaps --reasoning-budget doesn't work correctly with this model.

olliewalsh avatar Nov 12 '25 14:11 olliewalsh

@olliewalsh thanks for the suggestion on --runtime-args. I've read in the MR that some models might still not behave as expected even when forcing a budget or limiting thinking, which might explain that.

I wonder if this issue should be closed then, seems to work as expected from Ramalama's perspective.

csoriano2718 avatar Nov 14 '25 09:11 csoriano2718