vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication

Open stas00 opened this issue 1 year ago • 6 comments

Your current environment

0.6.3.post1

4 🐛generation scenarios

There are at least 4 generation use cases in vLLM:

  1. offline generate
  2. offline chat
  3. online completion (similar to 1 but online and a totally different implementation)
  4. online chat completion (similar to 2 but online and a totally different implementation)

It's up to the user whether they want to handle the chat template themselves and then use (1) or (3) or let vllm do the chat template handling and then it's (2) or (4).

Summary of BOS injection/duplication

I have traced all 4 APIs wrt BOS-injection and here is what I see (0.6.3post1):

  1. offline generate - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  2. offline chat - BOS is still always forced - so generates 2 BOS tokens if the template already has one - this is a BUG and can't be overcome by a user, other than by passing a custom chat template which has <|begin_of_text|> manually removed.
  3. client.completion - the client sorts out the chat template - BOS is forced always - so generates 2 BOS tokens if the prompt already has one - so the user has to send a prompt w/o <|begin_of_text|>
  4. client.chat.completions - the chat template is applied on the server side: here the BOS isn't added twice - if the template contains <|begin_of_text|> it encodes it properly - ending up with a single BOS

Expectations and bugs

So for (1) and (3) one could say it's the user's responsibility to strip any BOS tokens in the prompt since a normal prompt is expected here. (normal == pure text w/o any special tokens - as in "Today is")

(2) is clearly a bug and it's inconsistent with (4). With meta-llama/Meta-Llama-3-8B-Instruct you would see this logged with (2): {'prompt_token_ids': [128000, 128000, 128006, 9125, 128007, ... where 128000 is the BOS token.

(4) used to have this problem but has been fixed in https://github.com/vllm-project/vllm/pull/4688

Analysis process

The online API already logs the token ids it's about to feed to the model so that was easy. The offline API doesn't do it - so I had to add:

diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 61c21887..71b82baf 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -808,6 +808,9 @@ class LLMEngine:
             lora_request=lora_request,
             prompt_adapter_request=prompt_adapter_request,
         )
+
+        print(preprocessed_inputs)
+
         processed_inputs = self.input_processor(preprocessed_inputs)

         # This is a bit of a hack - copy the mm_processor_kwargs that were

Request: is it possible to codify the above diff - so that the user could debug the offline scenario in the same way the online scenario currently logs:

INFO 10-18 17:53:34 logger.py:37] Received request cmpl-da6d014eca6e48acb5abb2d2cae39182-0: 
prompt: '<|begin_of_text|><|start_header_id|>..
prompt_token_ids: [128000, 128000, 128006, 9125, 128007...

Needed documentation

Wrt (1) and (3) I'd imagine vllm should have a clear documentation of when it adds BOS forcefully. That is ideally the prompt doc needs to say that it must not include tokenizer.bos_token (e.g. <|begin_of_text|> in many tokenizers).

Reproduction

To reproduce I was just using your examples:

  1. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html
  2. https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference_chat.html etc. but prepended the existing prompt with <|begin_of_text|> in the non-chat examples to test.

Thank you!

stas00 avatar Oct 18 '24 21:10 stas00

Thanks again @stas00 for this detailed analysis! You are clearly correct re the (2) bug.

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added. As you found, this defaults to False for the chat case (since it's included in the templates) and True for non-chat (completion) case. But you can override in either case if needed.

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

njhill avatar Oct 18 '24 22:10 njhill

I will add that there is a per-request add_special_tokens parameter that can be used with both (3) and (4) which will control whether the BOS token is added

I actually was trying to test it out already, but the openai's client.chat.completions and client.completions don't have this kwarg - where/how do you add add_special_tokens? I saw that you extended the openAI API, but I didn't find documentation to how to use it with the openAI client.

stas00 avatar Oct 18 '24 22:10 stas00

@stas00 you can pass any of the "non-standard" API params in a dict via the extra_body kwarg of the openai client sdk. Ref here.

njhill avatar Oct 18 '24 22:10 njhill

Also worth mentioning that whether a BOS token is added automatically depends on the particular model (tokenizer). But it's the case for e.g. llama and mistral tokenizers.

Once the dust settles on this Issue, let's document the various ways BOS is handled, including your note above?

stas00 avatar Oct 18 '24 22:10 stas00

Ah, thank you so much Nick, for some reason I was drawing blank on finding how to use it.

Now I was able to tell vllm not to add BOS in the online generate case with:

    completion = client.completions.create(
        model=model,
        prompt=prompt,
[...]
        extra_body=dict(add_special_tokens=False),
    )

super!

stas00 avatar Oct 18 '24 22:10 stas00

should the LLM API be consistent with the client API and offer add_special_tokens=False? I guess it'd belong to SamplingParams and not LLM

SamplingParams already has skip_special_tokens - but refers to output, so it'd be confusing to add add_special_tokens since now it'd be unclear whether it's an input or output related.

Possibly it might require disambiguation with:

  • output_skip_special_tokens
  • input_add_special_tokens

stas00 avatar Oct 18 '24 22:10 stas00

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jan 17 '25 01:01 github-actions[bot]

any updates on (2)?

KADCA21 avatar Jan 22 '25 12:01 KADCA21

https://github.com/vllm-project/vllm/pull/16081 is hopefully a fix for the (2) inconsistency.

njhill avatar Apr 05 '25 01:04 njhill

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jul 04 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Aug 04 '25 02:08 github-actions[bot]

this issue also impacts trl library: https://github.com/huggingface/open-r1/issues/695

wenquanlu avatar Aug 06 '25 05:08 wenquanlu