Improved prefix caching for paged attention
Summary
- fix
LogicalTokenBlock::pop_tokenso block size stays constant - re-enable paged prefix caching now that it works
Testing
-
cargo test -p mistralrs-core --no-run(fails: extern location for darling_core does not exist)
https://chatgpt.com/codex/tasks/task_e_6841cd341b30832290902dc473c3c3f4
Summary by CodeRabbit
-
Bug Fixes
- Improved internal token management to ensure more reliable token removal and memory handling.
-
New Features
- Enabled functional caching for sequences backed by the block engine, ensuring proper cache creation and reference count management.
- Enhanced block caching and matching logic for better handling of token blocks and offsets in paged-attention sequences.
[!IMPORTANT]
Review skipped
Draft detected.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
"""
Walkthrough
Internal logic for the pop_token method in the LogicalTokenBlock struct was changed to clear tokens without shrinking the tokens vector in two modules. Additionally, the add_sequence method in PrefixCacheManagerV2 was updated to fully implement block-engine-backed caching, replacing previously commented-out code with active logic for cache creation, reference counting, and improved handling of logical token blocks and token offsets.
Changes
| File(s) | Change Summary |
|---|---|
| mistralrs-core/src/dummy_paged_attention/block_engine.rs, mistralrs-core/src/paged_attention/block_engine.rs |
Modified LogicalTokenBlock::pop_token to clear the last token by setting it to zero instead of removing it from the vector. |
| mistralrs-core/src/prefix_cacher.rs | Activated and refined block caching logic in PrefixCacheManagerV2::add_sequence; enhanced search_for_matching_cache to manage logical blocks and token offsets for paged-attention caches. |
Sequence Diagram(s)
sequenceDiagram
participant User
participant PrefixCacheManagerV2
participant BlockEngine
participant BlockCaches
User->>PrefixCacheManagerV2: add_sequence(sequence)
PrefixCacheManagerV2->>BlockEngine: get_mut_arcmutex()
PrefixCacheManagerV2->>BlockEngine: access block table for sequence_id
BlockEngine-->>PrefixCacheManagerV2: block_table or panic if not found
PrefixCacheManagerV2->>PrefixCacheManagerV2: hash logical token blocks
PrefixCacheManagerV2->>BlockCaches: check if hash exists
alt Not cached
PrefixCacheManagerV2->>BlockEngine: increment ref count for each block
PrefixCacheManagerV2->>BlockCaches: insert new BlockCacheElement
end
Possibly related PRs
-
EricLBuehler/mistral.rs#1359: Refactors
PrefixCacheManagerV2cache data structures and search approach, affecting the same module and cache management logic as this PR.
Poem
A hop and a skip through the cache we go,
Clearing tokens, not shrinking the row.
Blocks are counted, caches are made,
No more commented code in the shade!
With every pop, the logic stays neat—
A rabbit’s work is always discrete.
🐇✨ """
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-
I pushed a fix in commit <commit_id>, please review it. -
Explain this complex logic. -
Open a follow-up GitHub issue for this discussion.
-
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:-
@coderabbitai explain this code block. -
@coderabbitai modularize this function.
-
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:-
@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase. -
@coderabbitai read src/utils.ts and explain its main purpose. -
@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format. -
@coderabbitai help me debug CodeRabbit configuration file.
-
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
-
@coderabbitai pauseto pause the reviews on a PR. -
@coderabbitai resumeto resume the paused reviews. -
@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository. -
@coderabbitai full reviewto do a full review from scratch and review all the files again. -
@coderabbitai summaryto regenerate the summary of the PR. -
@coderabbitai generate docstringsto generate docstrings for this PR. -
@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR. -
@coderabbitai resolveresolve all the CodeRabbit review comments. -
@coderabbitai configurationto show the current CodeRabbit configuration for the repository. -
@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
Code Metrics Report
=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 3 62 53 0 9 CSS 1 473 408 14 51 Dockerfile 1 42 23 10 9 HTML 1 73 61 4 8 JavaScript 7 1248 936 174 138 JSON 14 123 122 0 1 Makefile 1 6 5 0 1 Python 87 4097 3457 161 479 Shell 1 63 26 18 19 Plain Text 3 3723 0 2413 1310 TOML 21 695 634 10 51 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 60 5211 0 3984 1227 |- BASH 11 123 117 2 4 |- JSON 2 42 42 0 0 |- Python 7 121 109 0 12 |- Rust 22 757 634 1 122 |- TOML 2 75 63 0 12 (Total) 6329 965 3987 1377 ------------------------------------------------------------------------------- Rust 376 132481 117880 2913 11688 |- Markdown 175 3002 29 2662 311 (Total) 135483 117909 5575 11999 =============================================================================== Total 581 148318 123624 9703 14991 ===============================================================================
@EricLBuehler - i was hoping this might be the silver bullet to the OOMs i've been seeing but unfortunately i was able to reproduce without paged attention at all yesterday. That said, having our own MMU approximation with ref counts and awareness of utilization should serve to prevent us from calling for memory that isn't there in many cases (asking for a 16g alloc when we've used up 24 on a 32g card will not fly). Any particular blockers on implementation or did the logic not pan out as you'd hoped?