[Bug]: `v0.8.5`: Special tokens (`<think>`, `</think>`) are split during streaming with Qwen3-FP8
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
When serving the Qwen3-FP8 model using vLLM v0.8.5 and enabling streaming output, special tokens like
This issue occurs both when running the server with the basic command:
vllm serve Qwen/Qwen3-8B-FP8
vllm serve Qwen/Qwen3-8B-FP8 --enable-reasoning-parser --reasoning-parser deepseek_r1
Steps to Reproduce:
- Install vLLM
v0.8.5. - Start the vLLM server using either of the commands mentioned above.
- Send a request to the server's generation endpoint (e.g.,
/v1/chat/completionsor/generate) with stream=True. Use a prompt that is likely to cause the Qwen/Qwen3-8B model to output<think>...</think>blocks.
Expected Behavior:
Special tokens like
Example of expected chunk sequence:
Chunk 1: <think>
Chunk 2: \n
Chunk 3: Okay
Chunk 4: \n
Chunk 5: </think>
Chunk 6: ...final answer...
Actual Behavior:
The special tokens are split across chunk boundaries.
Example of actual chunk sequence observed: Chunk 1: <th Chunk 2: i Chunk 3: nk>\n Chunk 4: Okay Chunk 5: \n< Chunk 6: / Chunk 7: t Chunk 8: hin Chunk 9: k Chunk 10: >\n Chunk 11: ...final answer...
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
You should change the argument.
- --enable-reasoning-parser
+ --enable-reasoning
enable-reasoning-parser seems mandatory on 0.8.5, I get a TypeError enable-reasoning requires reasoning-parser
you can try it. --enable-reasoning --reasoning-parser deepseek_r1
@csy1204 @cyr1x @chaunceyjiang That's not the issue.
Seems like the <think> is not a special token in the vocab.
FWIW, I'm having the exact same issue, with same vllm start params. Issue does not seem present in other reasoning models, though, so I wonder if it's specific to Qwen3 FP8 variants.
https://huggingface.co/Qwen/Qwen3-8B-FP8/blob/main/tokenizer_config.json#L197
It has the special token in the tokenizer config json. Weird.
I think it is caused by the model, instead of our implementation. We get the generated text token by token. We expect the whole <think> in one generation instead of a single char from the model.
@gaocegege That's strange. I checked the model's configuration values as well, but I couldn't find anything unusual. I'm curious about how this issue occurred.
Indeed. I also want to know. Interesting.
Maybe we could give it a try with huggingface transformers first.
@gaocegege Turning on the --tokenizer slow option resolved some of the token issues, but trim() was still necessary.
https://github.com/huggingface/chat-ui/issues/1807#issuecomment-2841759886
In my opinion, there seems to be an issue with detokenizer or related logic.
@gaocegege I tested with v0.8.5.post1 and it seems like the issue has been resolved, but there's no related information in the update logs... Why is that?
@chaunceyjiang Do you have any idea about it?
Hey @calycekr since you closed as complete: is this confirmed fixed? If so can you point to the commit that fixed it? TIA.
@rdodev Yes, the issue is confirmed fixed in v0.8.5.post1. I'm not sure which exact commit resolved it, as it doesn't appear in the release notes. It may have been fixed indirectly through dependency updates.
Let me know if you'd like me to look into it further!
The root cause why i occurs Example of actual chunk sequence observed: Chunk 1: <th Chunk 2: i Chunk 3: nk>\n Chunk 4: Okay Chunk 5: \n< Chunk 6: / Chunk 7: t Chunk 8: hin Chunk 9: k Chunk 10: >\n Chunk 11: ...final answer...
is vllm will truncated the text of every chunk when checking if it encounter the 'stop str" . Here is the source code . vllm/sampling_params.py ` class SamplingParams( """Sampling parameters for text generation. stop: list of strings that stop the generation when they are generated. The returned output will not contain the stop strings. stop_token_ids: list of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. """
The below fields are not supposed to be used as an input.
# They are set in post_init.
output_text_buffer_length: int = 0
# Number of characters to hold back for stop string evaluation
# until sequence is finished.
if self.stop and not self.include_stop_str_in_output:
self.output_text_buffer_length = max(len(s) for s in self.stop) - 1
... `
/opt/miniconda/lib/python3.12/site-packages/vllm/entrypoints/openai/protocol.py
class ChatCompletionRequest(OpenAIBaseModel): stop: Optional[Union[str, list[str]]] = []
The 'stop' field you pass through the request will affect the output_text_buffer_length in SamplingParams.
/vllm/outputs.py ` class RequestOutput: ... @classmethod def from_seq_group( ... text_buffer_length = sampling_params.output_text_buffer_length delta = sampling_params.output_kind == RequestOutputKind.DELTA
outputs = []
include_prompt = True
# num_cached_tokens should be the same for all the sequences
num_cached_tokens = None
for i, seq in enumerate(top_n_seqs):
output_text = seq.get_output_text_to_return(
text_buffer_length, delta)
/vllm/sequence.py
class Sequence:
def get_output_text_to_return(self, buffer_length: int,
delta: bool) -> str:
"""If delta is True, only new text since the last call to
this method is returned"""
# We return the full output text if the sequence is finished.
truncate = buffer_length and not self.is_finished()
if not delta:
return self.output_text[:-buffer_length] if truncate else (
self.output_text)
length = len(self.output_text)
if truncate:
length -= buffer_length
last_offset = self._last_output_text_offset
if last_offset < length:
self._last_output_text_offset = length
return self.output_text[last_offset:length]
return ""
` The output_text_buffer_length will result in text buffering or text truncating in every chunk because vllm needs to cache text to prevent include_stop_str_in_output (if set true)
Quick fix: use stop_token_ids instead of stop field in SamplingParams or ChatCompletionRequest when use reasoning or tool call feature. stop str checking happens in post-detokenize , while stop_token_ids checking happens in before-detokenize . The reasoning parser and tool call parser happens in post-detokenize, which will possibly receive truncated text of every chunk.