lmdeploy
lmdeploy copied to clipboard
support qwen3 /think & /no_think & enable_thinking parameter
- Add support for parsing the "/think" and "/no_think" commands, with "/no_think" mode as the default.
- When the model does not be told to think, add "
\n\n \n\n" to the prompt to skip thinking. - Support for the qwen3 model to set the thinking mode through the "enable_thinking" parameter or the "/think" command in chat_completions_v1 api.
Related issues: [https://github.com/InternLM/lmdeploy/issues/3511]
Hi, @BUJIDAOVS Thank you very much for the contribution to LMDeploy. There is some linting error. Please kindly fix it as follows:
pip install pre-commit==3.8.0 # make sure the python version < 3.11
cd lmdeploy # the root directory of lmdeploy repo
pre-commit install
pre-commit run --all-files
Thanks again for your dedicated contributions.
As I was testing the functionality, how are we expected to use this feature? Currently, I use the following commands after launching the API server.
- Disable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/nvme4/huggingface_hub/hub/models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": false
}'
- Enable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": true
}'
With "enable_thinking": false, the output contents still have the thinking process. Is there anything wrong with my test commands?
Thanks again for your dedicated contributions.
As I was testing the functionality, how are we expected to use this feature? Currently, I use the following commands after launching the API server.
- Disable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/nvme4/huggingface_hub/hub/models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834", "messages": [ {"role": "user", "content": "Give me a short introduction to large language models."} ], "temperature": 0.7, "top_p": 0.8, "max_tokens": 1024, "enable_thinking": false }'
- Enable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834", "messages": [ {"role": "user", "content": "Give me a short introduction to large language models."} ], "temperature": 0.7, "top_p": 0.8, "max_tokens": 1024, "enable_thinking": true }'With
"enable_thinking": false, the output contents still have the thinking process. Is there anything wrong with my test commands?
Specify "--chat-template qwen3" when starting the service. In this template, the default model is in No-Think Mode. Add "enable_thinking": true switches to Think mode.
