[Core][VLM] Add support for placeholder token content hashes
This adds support (that is currently unused) for prefix caching with blocks that contain placeholder tokens whose embedding vectors will be replaced downstream, as is the case with multi-modal placeholders.
With this change, BlockManager v2 et al. now pass a TokenIds type around instead of List[int] to represent token ids. This new type can also contain TokenRangeAnnotations which capture the contents that will ultimately replace the placeholder tokens. These annotations are set on LLMInputs, and responsibility for computing the hash is delegated elsewhere. For multi-modal models, the multi-modal content can be hashed in the model's input processor.
Combined with #8346 (+ some glue to map multimodal content + PlaceholderRanges to TokenRangeAnnotations) this allows multimodal models (that use the precise embedding merge added in #8346) to work with prefix caching.
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
Sorry for the delay - I was busy with Pixtral release last week but will review this PR this week!
This pull request has merge conflicts that must be resolved before it can be merged. @petersalas please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @petersalas.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Hi, thanks for the great workt! I was wondering if there’s any update on its status or an estimated timeline for its review/merge?
@cooleel We decided to work on adding prefix caching for multimodal models on V1 instead since there are some fundamental changes on how cache manager is designed. Stay tuned and feel free to check our multimodality roadmap at #4194!
Closing as superseded by #11187