vllm Add option to completion API to truncate prompt tokens

Hey vLLM team - thanks for the awesome project.

This PR adds an additional option to the OpenAI completion API that allows one to truncate the number of input tokens for an individual request. We find this feature extremely useful for benchmarking and performance evaluation.

Without this option, if we want precise control of the number of input tokens, we need to implement tokenization on the client-side (e.g., in our load test environment) which introduces a bunch of dependencies. In this sense, we can live without this feature but it is super convenient to be able to do truncation on the server-side.

I have tried to keep the changes to a minimum, but if there is interest I have also implemented this for the AsyncLLMEngine and LLMEngine.

Mar 01 '24 14:03 tdoublep

CI errors do not look related to these changes

Mar 26 '24 15:03 tdoublep

Apologies @tdoublep - it seems I submitted a simple comment as a review. That was not my intention.

I think this is great and would be very excited to have this merged as we have also had to implement tokenization clientside to prevent vllm rejecting requests that are too long.

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

Thanks!

Mar 29 '24 14:03 diego898

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

If this is a valid use case, maybe let's support it in this PR as well?

Mar 29 '24 16:03 simon-mo

For reference, this is the HF tokenizer docs around truncation_side. This PR hard-codes that to "left"

Mar 29 '24 16:03 diego898

@diego898 @simon-mo I made the truncation_side configurable via a new option in the ModelConfig. Please take a look and let me know what you think.

Apr 03 '24 10:04 tdoublep

My previous comment is about truncation side, as for various reasons/formats we'd either want to trim from the left or right as well and since it already a parameter that has to get set, may make sense to add both together.

@diego898 @simon-mo could you give an example of a case where it would make sense to truncate on the right for autoregressive text generation? I can't think of one but could be a failure of imagination.

I don't think the fact that HF tokenizers support this in itself is a good reason to support it here, they are much more general and used in many other contexts such as data prep and training, and with different kinds of models.

Apr 03 '24 14:04 njhill

@njhill - great point. I guess I was thinking it would depend how someone structures their context window. Ex: for RAG:

|--[----chat history----]-[System prompt]--[----------------Context docs------------------]-[current question]-|

You may want to only to take from the left. but if isntead you did:

|[System prompt]--[----------------Context docs------------------]-[current question]---[----chat history----]-|

You may want trim from the right? Or

|[System prompt]-[----chat history----]---[current question]--[----------------Context docs------------------]|

again from the right?

I'm not really suggesting any of these layouts are better/worse. Several wouldnt even make sense.... Just relaying what I was thinking when suggsting to make it confirgurable.

But, if vLLM decides left-side truncation is the norm/default, that would also be fne and end-users can structure their context windows accordingly!

Apr 03 '24 16:04 diego898

Thanks @diego898 TBH I don't think it would make sense to use this truncate option at all in conjunction with a system prompt. Some other form of truncation would need to be used, i.e. when applying the chat template, to retain the system prompt and then exclude the beginning of the user prompt or conversation.

And I don't think that right-truncation would be sensible either for any of the examples you gave. In a chat the last thing you would want to exclude is the most recent messages, and if you truncated in the middle of some arbitrary context docs, the LM would just attempt to continue writing that particular doc.

I would still vote for keeping this left-truncated only. If a concrete need arose in future it would still be easy to add the option to configure the side.

Apr 04 '24 02:04 njhill

@simon-mo Do you agree with @njhill that it makes sense to hard-code truncation_side to left? If so, I will revert this branch to the earlier commit.

Apr 04 '24 13:04 tdoublep

I apologize for the confusion my request caused! I 100% agree with you @njhill - that is in fact what we do - truncate the chat history and not either side.

I stretched my brain to try and describe a situation where right truncation may make sense, but didn't convince even myself!

I confess, my initial request was based solely on the fact that HF has it configurable.

I apologize @tdoublep for the extra work this caused you!

Apr 04 '24 19:04 diego898

I trust @njhill to decide and merge.

Apr 04 '24 22:04 simon-mo

@njhill I've reverted back to the version with truncation side fixed to left, and resolves some minor conflicts with changes on main branch. I think it is ready to merge.

Apr 05 '24 08:04 tdoublep

IDK why ruff is failing - the formatting checks are passing for me locally:

$ bash format.sh 
vLLM yapf: Done
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

Apr 05 '24 08:04 tdoublep

Why not also add it for Chat Completion API?

Apr 09 '24 04:04 satpalsr

vllm vllm copied to clipboard

Add option to completion API to truncate prompt tokens

vllm
vllm copied to clipboard