vllm [RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API)

[RFC]: Pinned Caching with Automatic Prefix Caching (Related to Anthropic Prompt Caching API)

Open llsj14 opened this issue 5 months ago • 10 comments

When using automatic prefix caching that manages blocks in an LRU (Least Recently Used) manner, it would be useful to add a pinned caching feature, where blocks are retained until a Time to Live (TTL) expires or a specific duration is reached.
The Anthropic API supports prompt caching with TTL, which refreshes as prompts and their corresponding blocks are reused. This functionality is currently not possible in vLLM, as prefix caching operates solely in LRU mode.
Adding pinned caching would enhance the control logic for caching by allowing additional flexibility. I am considering features such as TTL, fixed expiration times, and manual expiration for pinned caching.

Managing pinned caching at the block level can be complex. I believe managing it at the sequence level would suffice. Therefore, a PinnedCachingManager will handle pinned caching sequences, placed directly under the Scheduler.
To reduce implementation complexity, pinned caching will only be supported for GPU memory and will not allow swapping into CPU memory. Pinned caching will be restricted to the prefill stage to prevent swapping into CPU memory.
Expiration logic will include TTL (Anthropic-style), fixed time, and manual expiration options. These will be implemented as functions with arguments, allowing for the addition of other expiration strategies.
Manual expiration will also be useful, as users may want to manually expire pinned cached sequences and their associated blocks.
I added a pinned caching option to the sampling parameters and used an existing API. However, there is an issue regarding whether to add APIs for adding, expiring, and retrieving information about pinned cached sequences.

2 Weeks, 9/11-9/25

cc. @alexm-neuralmagic @robertgshaw2-neuralmagic @Yard1 @cadedaniel @youkaichao

I have drafted code to implement these features and hope to refine it through discussions here. https://github.com/vllm-project/vllm/pull/8334

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 10 '24 13:09 llsj14