feat: Add chat template content like `<think>` to response
Motivation
Currently, the chat template adds prefixes (e.g., <think> for DeepSeek models) to assistant messages, but these prefixes aren't included in the API response. This causes confusion for users. This PR ensures all chat template prefixes are properly included in responses.
Modifications
Modified the OpenAI API's chat completion endpoint to properly include chat template prefixes in responses. This handles all cases including forced outputs like <think> or wrapped tokens like <think></think>.
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
@shuaills Nice @sleepcoo @minleminzui Nice PR. When is it ready to merge?
@shuaills Nice @sleepcoo @minleminzui Nice PR. When is it ready to merge?
@zhaochenyang20 This PR requires approval from all of these people: @CatherineSue, @ispobock and @sleepcoo. Our previous changes to the oai adapter affected their internal services, so they need to confirm that there are no issues with any future changes to the oai adapter.
If want to force the prefix
FYR, the default response from vllm is
{"id":"chatcmpl-7cf0143a0d374aad8bdc3360c48619a3","object":"chat.completion","created":1747105380,"model":"DeepSeek-R1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user is asking who won the World Series in 2020. Let me recall... T
I'm not very familiar with the context, but performing two separate tokenization steps feels like a workaround but not a robust fix? I'm also concerned whether this affects models other than deepseek.
The intention of this PR is to handle cases where additional content is added after the assistant's response in the chat template, ensuring all such content is properly output. It should affects all models.
Two separate tokenization steps would introduce some differences due to the nature of tokenization. I will modify this part.
Thanks.
If want to force the prefix to be generated, is it more elegant to set a chat template? I personally think it is better than implementing it through code here, because some services that use similar models have already made compatible adaptations. After this PR is introduced, many services will probably need to make adaptation adjustments.
I agree. This API change may require additional adaptations in user service code. Is this setting commonly used with the standard OpenAI API? If not, we don't need to enable it by default.
is it more elegant to set a chat template?
@lambert0312 I also agree, it is indeed necessary to consider compatibility. This way of changing the code is not as convenient as applying a custom chat template. In fact, all that needs to be done is to remove <think>\n from the r1 chat template, without needing to modify the code.
We may be able to offer a way to enable custom chat templates with --chat-template deepseek-r1, by default it remains unchanged, I believe many online services already using SGLang will default to using a custom chat template. wdyt?
In fact, all that needs to be done is to remove
<think>\nfrom the r1 chat template, without needing to modify the code.
yes