lmdeploy support qwen3 /think & /no_think & enable

Add support for parsing the "/think" and "/no_think" commands, with "/no_think" mode as the default.
When the model does not be told to think, add "\n\n\n\n" to the prompt to skip thinking.
Support for the qwen3 model to set the thinking mode through the "enable_thinking" parameter or the "/think" command in chat_completions_v1 api.

Related issues: [https://github.com/InternLM/lmdeploy/issues/3511]

May 15 '25 14:05 BUJIDAOVS

Hi, @BUJIDAOVS Thank you very much for the contribution to LMDeploy. There is some linting error. Please kindly fix it as follows:

pip install pre-commit==3.8.0  # make sure the python version < 3.11
cd lmdeploy # the root directory of lmdeploy repo
pre-commit install
pre-commit run --all-files

May 16 '25 08:05 lvhan028

Thanks again for your dedicated contributions.

As I was testing the functionality, how are we expected to use this feature? Currently, I use the following commands after launching the API server.

Disable thinking

curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/nvme4/huggingface_hub/hub/models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": false
}'

Enable thinking

curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": true
}'

With "enable_thinking": false, the output contents still have the thinking process. Is there anything wrong with my test commands?

May 20 '25 07:05 CUHKSZzxy

Thanks again for your dedicated contributions.

As I was testing the functionality, how are we expected to use this feature? Currently, I use the following commands after launching the API server.

Disable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/nvme4/huggingface_hub/hub/models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": false
}'
Enable thinking
curl http://localhost:5656/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "models--Qwen--Qwen3-8B/snapshots/a80f5e57cce20e57b65145f4213844dec1a80834",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"enable_thinking": true
}'
With "enable_thinking": false, the output contents still have the thinking process. Is there anything wrong with my test commands?

Specify "--chat-template qwen3" when starting the service. In this template, the default model is in No-Think Mode. Add "enable_thinking": true switches to Think mode.

May 20 '25 10:05 BUJIDAOVS

lmdeploy
lmdeploy copied to clipboard

support qwen3 /think & /no_think & enable_thinking parameter

lmdeploy lmdeploy copied to clipboard

support qwen3 /think & /no_think & enable_thinking parameter

lmdeploy
lmdeploy copied to clipboard