[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication
Your current environment
0.6.3.post1
4 🐛generation scenarios
There are at least 4 generation use cases in vLLM:
- offline generate
- offline chat
- online completion (similar to 1 but online and a totally different implementation)
- online chat completion (similar to 2 but online and a totally different implementation)
It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).
Summary of BOS injection/duplication
I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):
- offline
generate- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|> - offline
chat- BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has<|begin_of_text|>manually removed. client.completion- the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o<|begin_of_text|>client.chat.completions- the chat template is applied on the server side: here the BOS isn't added twice - if the template contains<|begin_of_text|>it encodes it properly - ending up with a single BOS
Expectations and bugs
So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")
(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.
(4) used to have this problem but has been fixed in https://github.com/vllm-project/vllm/pull/4688
Analysis process
The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
lora_request=lora_request,
prompt_adapter_request=prompt_adapter_request,
)
+
+ print(preprocessed_inputs)
+
processed_inputs = self.input_processor(preprocessed_inputs)
# This is a bit of a hack - copy the mm_processor_kwargs that were
Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:
INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0:
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...
Needed documentation
Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).
Reproduction
To reproduce I was just using your examples:
- https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
- https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html
etc. but prepended the existing prompt with
<|begin_of_text|>in the non-chat examples to test.
Thank you!
Thanks again @stas00 for this detailed analysis! You are clearly correct re the (2) bug.
I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added. As you found, this defaults to False for the chat case (since it's included in the templates) and True for non-chat (completion) case. But you can override in either case if needed.
Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.
I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added
I actually was trying to test it out already, but the openai's client.chat.completions and client.completions don't have this kwarg - where/how do you add add_special_tokens? I saw that you extended the openAI API, but I didn't find documentation to how to use it with the openAI client.
@stas00 you can pass any of the "non-standard" API params in a dict via the extra_body kwarg of the openai client sdk. Ref here.
Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.
Once the dust settles on this Issue, let's document the various ways BOS is handled, including your note above?
Ah, thank you so much Nick, for some reason I was drawing blank on finding how to use it.
Now I was able to tell vllm not to add BOS in the online generate case with:
completion = client.completions.create(
model=model,
prompt=prompt,
[...]
extra_body=dict(add_special_tokens=False),
)
super!
should the LLM API be consistent with the client API and offer add_special_tokens=False? I guess it'd belong to SamplingParams and not LLM
SamplingParams already has skip_special_tokens - but refers to output, so it'd be confusing to add add_special_tokens since now it'd be unclear whether it's an input or output related.
Possibly it might require disambiguation with:
output_skip_special_tokensinput_add_special_tokens
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
any updates on (2)?
https://github.com/vllm-project/vllm/pull/16081 is hopefully a fix for the (2) inconsistency.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
this issue also impacts trl library: https://github.com/huggingface/open-r1/issues/695