vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: `v0.8.5`: Special tokens (`<think>`, `</think>`) are split during streaming with Qwen3-FP8

Open calycekr opened this issue 7 months ago • 1 comments

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

When serving the Qwen3-FP8 model using vLLM v0.8.5 and enabling streaming output, special tokens like and (used for reasoning steps) are not treated as atomic units. Instead, they are often split across multiple streamed chunks.

This issue occurs both when running the server with the basic command:

vllm serve Qwen/Qwen3-8B-FP8
vllm serve Qwen/Qwen3-8B-FP8 --enable-reasoning-parser --reasoning-parser deepseek_r1

Steps to Reproduce:

  1. Install vLLM v0.8.5.
  2. Start the vLLM server using either of the commands mentioned above.
  3. Send a request to the server's generation endpoint (e.g., /v1/chat/completions or /generate) with stream=True. Use a prompt that is likely to cause the Qwen/Qwen3-8B model to output <think>...</think> blocks.

Expected Behavior:

Special tokens like and should be treated as atomic units by the tokenizer/detokenizer during streaming. Each special token should be contained entirely within a single streamed chunk.

Example of expected chunk sequence: Chunk 1: <think> Chunk 2: \n Chunk 3: Okay Chunk 4: \n Chunk 5: </think> Chunk 6: ...final answer...

Actual Behavior:

The special tokens are split across chunk boundaries.

Example of actual chunk sequence observed: Chunk 1: <th Chunk 2: i Chunk 3: nk>\n Chunk 4: Okay Chunk 5: \n< Chunk 6: / Chunk 7: t Chunk 8: hin Chunk 9: k Chunk 10: >\n Chunk 11: ...final answer...

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

calycekr avatar Apr 30 '25 10:04 calycekr

You should change the argument.

- --enable-reasoning-parser
+ --enable-reasoning

csy1204 avatar Apr 30 '25 13:04 csy1204

enable-reasoning-parser seems mandatory on 0.8.5, I get a TypeError enable-reasoning requires reasoning-parser

cyr1x avatar Apr 30 '25 14:04 cyr1x

you can try it. --enable-reasoning --reasoning-parser deepseek_r1

chaunceyjiang avatar Apr 30 '25 14:04 chaunceyjiang

@csy1204 @cyr1x @chaunceyjiang That's not the issue.

calycekr avatar May 02 '25 01:05 calycekr

Seems like the <think> is not a special token in the vocab.

gaocegege avatar May 02 '25 03:05 gaocegege

FWIW, I'm having the exact same issue, with same vllm start params. Issue does not seem present in other reasoning models, though, so I wonder if it's specific to Qwen3 FP8 variants.

rdodev avatar May 04 '25 17:05 rdodev

https://huggingface.co/Qwen/Qwen3-8B-FP8/blob/main/tokenizer_config.json#L197

It has the special token in the tokenizer config json. Weird.

gaocegege avatar May 07 '25 01:05 gaocegege

I think it is caused by the model, instead of our implementation. We get the generated text token by token. We expect the whole <think> in one generation instead of a single char from the model.

gaocegege avatar May 07 '25 01:05 gaocegege

@gaocegege That's strange. I checked the model's configuration values as well, but I couldn't find anything unusual. I'm curious about how this issue occurred.

calycekr avatar May 07 '25 02:05 calycekr

Indeed. I also want to know. Interesting.

gaocegege avatar May 07 '25 02:05 gaocegege

Maybe we could give it a try with huggingface transformers first.

gaocegege avatar May 07 '25 02:05 gaocegege

@gaocegege Turning on the --tokenizer slow option resolved some of the token issues, but trim() was still necessary. https://github.com/huggingface/chat-ui/issues/1807#issuecomment-2841759886

In my opinion, there seems to be an issue with detokenizer or related logic.

calycekr avatar May 07 '25 03:05 calycekr

@gaocegege I tested with v0.8.5.post1 and it seems like the issue has been resolved, but there's no related information in the update logs... Why is that?

calycekr avatar May 07 '25 04:05 calycekr

@chaunceyjiang Do you have any idea about it?

gaocegege avatar May 07 '25 07:05 gaocegege

Hey @calycekr since you closed as complete: is this confirmed fixed? If so can you point to the commit that fixed it? TIA.

rdodev avatar May 09 '25 11:05 rdodev

@rdodev Yes, the issue is confirmed fixed in v0.8.5.post1. I'm not sure which exact commit resolved it, as it doesn't appear in the release notes. It may have been fixed indirectly through dependency updates.

Let me know if you'd like me to look into it further!

calycekr avatar May 12 '25 05:05 calycekr

The root cause why i occurs Example of actual chunk sequence observed: Chunk 1: <th Chunk 2: i Chunk 3: nk>\n Chunk 4: Okay Chunk 5: \n< Chunk 6: / Chunk 7: t Chunk 8: hin Chunk 9: k Chunk 10: >\n Chunk 11: ...final answer...

is vllm will truncated the text of every chunk when checking if it encounter the 'stop str" . Here is the source code . vllm/sampling_params.py ` class SamplingParams( """Sampling parameters for text generation. stop: list of strings that stop the generation when they are generated. The returned output will not contain the stop strings. stop_token_ids: list of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. """

The below fields are not supposed to be used as an input.

# They are set in post_init.
output_text_buffer_length: int = 0

# Number of characters to hold back for stop string evaluation
# until sequence is finished.
if self.stop and not self.include_stop_str_in_output:
     self.output_text_buffer_length = max(len(s) for s in self.stop) - 1

... `

/opt/miniconda/lib/python3.12/site-packages/vllm/entrypoints/openai/protocol.py class ChatCompletionRequest(OpenAIBaseModel): stop: Optional[Union[str, list[str]]] = []

The 'stop' field you pass through the request will affect the output_text_buffer_length in SamplingParams.

/vllm/outputs.py ` class RequestOutput: ... @classmethod def from_seq_group( ... text_buffer_length = sampling_params.output_text_buffer_length delta = sampling_params.output_kind == RequestOutputKind.DELTA

    outputs = []
    include_prompt = True
    # num_cached_tokens should be the same for all the sequences
    num_cached_tokens = None
    for i, seq in enumerate(top_n_seqs):
        output_text = seq.get_output_text_to_return(
            text_buffer_length, delta)

/vllm/sequence.py class Sequence: def get_output_text_to_return(self, buffer_length: int, delta: bool) -> str: """If delta is True, only new text since the last call to this method is returned"""

    # We return the full output text if the sequence is finished.
    truncate = buffer_length and not self.is_finished()
    if not delta:
        return self.output_text[:-buffer_length] if truncate else (
            self.output_text)
    length = len(self.output_text)
    if truncate:
        length -= buffer_length
    last_offset = self._last_output_text_offset
    if last_offset < length:
        self._last_output_text_offset = length
        return self.output_text[last_offset:length]
    return ""

` The output_text_buffer_length will result in text buffering or text truncating in every chunk because vllm needs to cache text to prevent include_stop_str_in_output (if set true)

Quick fix: use stop_token_ids instead of stop field in SamplingParams or ChatCompletionRequest when use reasoning or tool call feature. stop str checking happens in post-detokenize , while stop_token_ids checking happens in before-detokenize . The reasoning parser and tool call parser happens in post-detokenize, which will possibly receive truncated text of every chunk.

12lalala avatar Nov 11 '25 08:11 12lalala