llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client

Open matteoserva opened this issue 7 months ago β€’ 8 comments

This PR implements handling additional jinja parameters. Used for example to set enable_thinking in Qwen3 models.

The official template is still partially compatible. I modified it to use only supported features. It's here: ~~https://pastebin.com/16ZpCLHk~~ https://pastebin.com/GGuTbFRc And should be loaded with llama-server --jinja --chat-template-file {template_file}

It fixes https://github.com/ggml-org/llama.cpp/issues/13160 and https://github.com/ggml-org/llama.cpp/issues/13189

Test it with:

  • enable_thinking=false. Expected: {"prompt":"\n<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"}
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'
  • enable_thinking=true
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": true}
}'
  • enable_thinking undefined
curl http://localhost:8080/apply-template -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5
}'

matteoserva avatar Apr 29 '25 18:04 matteoserva

can you add chat_template_kwargs to cli argument as well?

rhjdvsgsgks avatar Apr 29 '25 20:04 rhjdvsgsgks

can you add chat_template_kwargs to cli argument as well?

I added it. I tested it using updated command (You might want to check the escaping of the double quotes): --chat_template_kwargs "{\"enable_thinking\":false}" --jinja --chat-template-file qwen/qwen3_template.txt

matteoserva avatar Apr 30 '25 07:04 matteoserva

Very useful for Qwen3 series. +1 for this feature!

neolee avatar May 01 '25 03:05 neolee

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone --depth=1 -b enable_thinking https://github.com/matteoserva/llama.cpp Screenshot 2025-05-09 061336

xiaomi102 avatar May 08 '25 23:05 xiaomi102

@ggerganov is there any reason why this PR has not been accepted and merged yet?

celsowm avatar May 09 '25 00:05 celsowm

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

matteoserva avatar May 09 '25 09:05 matteoserva

Cannot work with the --chat_template_kwargs option from CLI: error: invalid argument: --chat-template-kwargs git clone

This PR is implemented only for llama-server and its webui.

llama-cli has unresolved bugs that prevent me from enabling this feature.

Hope you'll integrate it for the CLI environment soon, thanks!

xiaomi102 avatar May 09 '25 09:05 xiaomi102

Hope you'll integrate it for the CLI environment soon, thanks!

I'll open a new PR soon for llama-cli. The code is ready but it's blocked by https://github.com/ggml-org/llama.cpp/issues/13402 and https://github.com/ggml-org/llama.cpp/issues/13404

matteoserva avatar May 09 '25 13:05 matteoserva

would be nice a enable_thinking checkbox or something like that on llama cpp webui too

celsowm avatar May 15 '25 15:05 celsowm

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

strawberrymelonpanda avatar May 15 '25 16:05 strawberrymelonpanda

@ggerganov is there any reason why this PR has not been accepted and merged yet?

@celsowm Lack of eyes on this area would be my guess.

With 438 open PRs (many obsolete), I've kind of come to accept I'll need to pull in some PRs of interest to me when building.

vLLM and SGLang have got this feature the first day Qwen3 released. At the same time many useful enhancement and fix PRs become obsolete just because of delay on merging here in llama.cpp community. Really sad about that.

neolee avatar May 16 '25 04:05 neolee

This is so necessary when dealing with Qwen3! Can't wait to see this merged and be able to use the latest version with this <3

Neath avatar May 16 '25 20:05 Neath

FYI now that https://github.com/ggml-org/llama.cpp/pull/13573 is merged, the official template should work as expected and there's no need to use the modified one. We're one step closed to having proper support for Qwen3. One remaining thing is the correct handling of the ... part in previous assistant messages.

taha-yassine avatar May 16 '25 21:05 taha-yassine

  • While being able to pass kwargs will be very useful in general (πŸŽ‰), I think for the particular case of enable_thinking we will probably want to special case it, since there's a few thinking models around, some of which force the thinking tag open (a common --disable-thinking flag could close their tags, and set the enable_thinking: false for Qwen3).

Why not accept this PR to get the general "pass kwargs to chat template" feature at first, then implement the enable_thinking flag using this feature?

neolee avatar May 17 '25 02:05 neolee

@ochafik

* While being able to pass kwargs will be very useful in general (πŸŽ‰), I think for the particular case of enable_thinking we will probably want to special case it, since there's a few thinking models around, some of which force the thinking tag open (a common --disable-thinking flag could close their tags, and set the `enable_thinking: false` for Qwen3). [...]

I agree with @neolee . He expressed exactly my opinion on this.


* We'll want to pass the params even when there are tools (right now only setup in common_chat_params_init_without_tools). Ideally after the diffs PR goes in πŸ˜…

I added the support also to common_chat_params_init_generic() in the latest commit. I don't know if it makes sense for the other specialized init functions.

matteoserva avatar May 17 '25 16:05 matteoserva

  • While being able to pass kwargs will be very useful in general (πŸŽ‰), I think for the particular case of enable_thinking we will probably want to special case it, since there's a few thinking models around, some of which force the thinking tag open (a common --disable-thinking flag could close their tags, and set the enable_thinking: false for Qwen3).

Why not accept this PR to get the general "pass kwargs to chat template" feature at first, then implement the enable_thinking flag using this feature?

@neolee I think disabling thoughts should be handled more manually, the enable_thinking extra arg works for Qwen3 only, sent out https://github.com/ggml-org/llama.cpp/pull/13771 for review to handle the other supported thinking models.

I added the support also to common_chat_params_init_generic() in the latest commit. I don't know if it makes sense for the other specialized init functions.

@matteoserva Thanks! I think each and every code path should pass the extra context variables if we want to support generic use cases (although as I just noted in a review comment, we should be weary of requiring users to set custom key-values for features that would be better handled in a generic & unified way across templates & models; ideally most users won't need to inspect any templates, and the ones who do can trivially create their own template overrides anyway).

ochafik avatar May 25 '25 08:05 ochafik

Just to make explicit, the added value of this PR is that:

  • You can set variables per request instead of restarting the server with a different template
  • Compatibility with the VLLM api, so the same code works for both engines

matteoserva avatar May 25 '25 08:05 matteoserva

Thanks for this PR @matteoserva, I'm cherry-picking it for my own use :)

aviallon avatar May 26 '25 09:05 aviallon

I applied the requested changes but the latest rebase has been rough. Please review more carefully to make sure I didn't do any mistake (sorry).

matteoserva avatar May 27 '25 09:05 matteoserva

We do need this PR, maintainers please review this.

qinxuye avatar May 29 '25 12:05 qinxuye

It looks working well, you guys are so cool

exxocism avatar Jun 03 '25 12:06 exxocism

@ngxson shouldn't this be merged?

aviallon avatar Jun 05 '25 17:06 aviallon

Thanks to all contributors, looking forward to it

zghnwsq avatar Jun 06 '25 09:06 zghnwsq

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

squirrelfish avatar Jun 06 '25 09:06 squirrelfish

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

aviallon avatar Jun 06 '25 09:06 aviallon

I think this pr can also be applied to deepseek-r1, If we append "{% if ns.is_last_user %}{% if enable_thinking is defined and enable_thinking is false %}{{'\n\n\n\n'}}{% endif %}{% endif %}" to the DeepSeek template

This is not needed, as there is a new reasoning effort param already.

The new reasoning param requires restarting the server to change the value. With this PR you can set it per request.

There is ongoing effort to implement this in the future: https://github.com/ggml-org/llama.cpp/issues/13272

matteoserva avatar Jun 06 '25 10:06 matteoserva

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint. Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU -DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed. TYPE llamacpp:prompt_tokens_total counter llamacpp:prompt_tokens_total 143 HELP llamacpp:prompt_seconds_total Prompt process time TYPE llamacpp:prompt_seconds_total counter llamacpp:prompt_seconds_total 8.168 HELP llamacpp:tokens_predicted_total Number of generation tokens processed. TYPE llamacpp:tokens_predicted_total counter llamacpp:tokens_predicted_total 673 HELP llamacpp:tokens_predicted_seconds_total Predict process time TYPE llamacpp:tokens_predicted_seconds_total counter llamacpp:tokens_predicted_seconds_total 29.567 HELP llamacpp:n_decode_total Total number of llama_decode() calls TYPE llamacpp:n_decode_total counter llamacpp:n_decode_total 673 HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call TYPE llamacpp:n_busy_slots_per_decode counter llamacpp:n_busy_slots_per_decode 1 HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s. TYPE llamacpp:prompt_tokens_seconds gauge llamacpp:prompt_tokens_seconds 17.5073 HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s. TYPE llamacpp:predicted_tokens_seconds gauge llamacpp:predicted_tokens_seconds 22.7619 HELP llamacpp:requests_processing Number of requests processing. TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 HELP llamacpp:requests_deferred Number of requests deferred. TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0

rasbid avatar Jun 09 '25 14:06 rasbid

I've built a server version of this branch and I'm missing the KV cache % usage from the /metrics endpoint. Does anyone know why this is happening?

Flags used for build:

CXXFLAGS="-march=core2 -mtune=generic" cmake ..
-DLLAMA_BUILD_SERVER=ON
-DGGML_VULKAN=ON
-DGGML_NATIVE=OFF \ # don’t auto-detect CPU -DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF
-DGGML_FMA=OFF -DGGML_F16C=OFF
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF
-DGGML_SSE42=ON \
-DCMAKE_BUILD_TYPE=Release

Sample output of /metrics endpoint:

HELP llamacpp:prompt_tokens_total Number of prompt tokens processed. TYPE llamacpp:prompt_tokens_total counter llamacpp:prompt_tokens_total 143 HELP llamacpp:prompt_seconds_total Prompt process time TYPE llamacpp:prompt_seconds_total counter llamacpp:prompt_seconds_total 8.168 HELP llamacpp:tokens_predicted_total Number of generation tokens processed. TYPE llamacpp:tokens_predicted_total counter llamacpp:tokens_predicted_total 673 HELP llamacpp:tokens_predicted_seconds_total Predict process time TYPE llamacpp:tokens_predicted_seconds_total counter llamacpp:tokens_predicted_seconds_total 29.567 HELP llamacpp:n_decode_total Total number of llama_decode() calls TYPE llamacpp:n_decode_total counter llamacpp:n_decode_total 673 HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call TYPE llamacpp:n_busy_slots_per_decode counter llamacpp:n_busy_slots_per_decode 1 HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s. TYPE llamacpp:prompt_tokens_seconds gauge llamacpp:prompt_tokens_seconds 17.5073 HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s. TYPE llamacpp:predicted_tokens_seconds gauge llamacpp:predicted_tokens_seconds 22.7619 HELP llamacpp:requests_processing Number of requests processing. TYPE llamacpp:requests_processing gauge llamacpp:requests_processing 0 HELP llamacpp:requests_deferred Number of requests deferred. TYPE llamacpp:requests_deferred gauge llamacpp:requests_deferred 0

It may be unrelated to this branch, and more relates to changes in master (KV cache refactor). Can you reproduce with master?

aviallon avatar Jun 09 '25 23:06 aviallon

@rasbid These metrics have been removed recently - see the changelog: https://github.com/ggml-org/llama.cpp/issues/9291

ggerganov avatar Jun 10 '25 08:06 ggerganov