[Core] Move detokenization to front-end process
This PR takes detokenization off of the critical model path, moving it into the per-request api server logic in the case that the OpenAI serving front-end is used. This should be beneficial to throughput in general but will hopefully have an even more significant impact now that the front-end is in a separate process, since it also won't interfere with the model loop via the GIL.
Modifications
- Abstracted the existing incremental decoding logic in
detokenizer.pyto decouple them from theSequencetype that's internal to the scheduler. - An
IncrementalDetokenizerclass encapsulates the state required for incremental detokenization and is still used in the engine by default (sampling_params.detokenize = True) - OpenAI server code modified to pass
detokenize=Falseon all requests, and to detokenize the request outputs itself-
IncrementalDecoderis used in the streaming cases. - In the non-streaming cases regular
tokenizer.decode()is used which should be much faster since it's typically a single call into the rust fast tokenizer impl.
-
Note this doesn't currently affect/improve the LLM based usage so won't show up in benchmarks that use LLM.generate.
Still todo:
- Requests with stop strings - detokenization still happens in the engine for these for now. The stop string evaluation can also be moved to the front-end, can handle this in a follow-on PR.
- Minor adjustments will be needed to ensure that this works in conjunction with #7381 (depending on which is merged first)
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
QQ: How does this work with #6913?
QQ: How does this work with #6913?
This change is independent of / complimentary to #6913. This one is minimally invasive, the only change in the engine is to adjust the place where the detokenization is already done to use the abstracted/shared logic. Otherwise all it's doing is changing the OpenAI front-end to pass the already-existing detokenize=False option to tell the engine not to detokenize.
The front-end (in separate proc) does the detokenization instead as it gets the tokens back. Stop string evaluation can similarly be handled this way ... the front end can just abort the request (and it doesn't really matter if it's already generated an extra token).
I think #6913 should still be beneficial for removing all the other output processing from the critical model loop.
@WoosukKwon basically this is just avoiding the engine handling any text at all, so that it only deals in token ids. And we handle the text part outside.
Thanks for the great work @njhill!
Is this still being actively worked on?
@vrdn-23 this became less important after the "async output processing" changes, since detok is now overlapped with the model forward pass in the engine. However we're doing something very similar in the "v1" re-architecture currently underway.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
Now obsolete - all of these ideas have been incorporated into vLLM V1.