vllm [Core] Move detokenization to front-end process

This PR takes detokenization off of the critical model path, moving it into the per-request api server logic in the case that the OpenAI serving front-end is used. This should be beneficial to throughput in general but will hopefully have an even more significant impact now that the front-end is in a separate process, since it also won't interfere with the model loop via the GIL.

Modifications

Abstracted the existing incremental decoding logic in detokenizer.py to decouple them from the Sequence type that's internal to the scheduler.
An IncrementalDetokenizer class encapsulates the state required for incremental detokenization and is still used in the engine by default (sampling_params.detokenize = True)
OpenAI server code modified to pass detokenize=False on all requests, and to detokenize the request outputs itself
- IncrementalDecoder is used in the streaming cases.
- In the non-streaming cases regular tokenizer.decode() is used which should be much faster since it's typically a single call into the rust fast tokenizer impl.

Note this doesn't currently affect/improve the LLM based usage so won't show up in benchmarks that use LLM.generate.

Still todo:

Requests with stop strings - detokenization still happens in the engine for these for now. The stop string evaluation can also be moved to the front-end, can handle this in a follow-on PR.
Minor adjustments will be needed to ensure that this works in conjunction with #7381 (depending on which is merged first)

Aug 11 '24 16:08 njhill

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Aug 11 '24 16:08 github-actions[bot]

QQ: How does this work with #6913?

Aug 13 '24 21:08 WoosukKwon

QQ: How does this work with #6913?

This change is independent of / complimentary to #6913. This one is minimally invasive, the only change in the engine is to adjust the place where the detokenization is already done to use the abstracted/shared logic. Otherwise all it's doing is changing the OpenAI front-end to pass the already-existing detokenize=False option to tell the engine not to detokenize.

The front-end (in separate proc) does the detokenization instead as it gets the tokens back. Stop string evaluation can similarly be handled this way ... the front end can just abort the request (and it doesn't really matter if it's already generated an extra token).

I think #6913 should still be beneficial for removing all the other output processing from the critical model loop.

Aug 13 '24 22:08 njhill

@WoosukKwon basically this is just avoiding the engine handling any text at all, so that it only deals in token ids. And we handle the text part outside.

Aug 13 '24 22:08 njhill

Thanks for the great work @njhill!

Is this still being actively worked on?

Oct 18 '24 18:10 vrdn-23

@vrdn-23 this became less important after the "async output processing" changes, since detok is now overlapped with the model forward pass in the engine. However we're doing something very similar in the "v1" re-architecture currently underway.

Oct 31 '24 23:10 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 26 '24 05:11 mergify[bot]

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Feb 25 '25 02:02 github-actions[bot]

Now obsolete - all of these ideas have been incorporated into vLLM V1.

Mar 19 '25 18:03 njhill