vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[RFC]: Offload KV cache to CPU in V1

Open mengzhu28 opened this issue 8 months ago • 5 comments

Motivation.

Offloading device KV cache to the CPU can be worthwhile if the transfer overhead outweighs the re-computation, saving precious GPU cycles. This is especially useful in cases such as long, multi-turn conversations. Additionally, hardware improvements such as Nvidia C2C greatly accelerate CPU-GPU communication, making offloading even more compelling.

Proposed Change.

Design

The design space for KV cache offloading is broad. As an initial goal, we propose focusing primarily on offloading to the CPU. While we aim to keep the interface and implementation extensible—enabling future support for offloading to other mediums such as disk or remote storage—these are out of scope for this RFC.

A key design consideration is determining when to swap KV cache blocks out to the CPU and when to swap them back into the device.

For swap-out, the earliest opportunity is immediately after a KV cache block is generated, while the latest is just before it is evicted from the device. For swap-in, the earliest timing can be guided by a prefetching policy, while the latest is just before the next forward() call.

In this RFC, we propose a lazy swap-in/swap-out approach that runs after each scheduling step. Optimizations such as eager eviction, prefetching, or even layer-wise transfers can be added independently in the future to improve performance.

Specifically, during each scheduling step, the KV cache manager will accumulate swap-in and swap-out decisions for each request, and generate a swap plan at the end of the step. This swap plan becomes part of the scheduler output and is executed by the model runner prior to the model forward.

Interface

The KV cache manager will continue to manage all the metadata with roughly the same API with minor changes:

  • get_computed_blocks: Additionally returns the set of KV blocks currently cached on the CPU.
  • allocate_slots: Additionally allocates new device blocks to host CPU blocks that are scheduled for swap-in (i.e. the cache hit CPU blocks returned by get_computed_blocks).
  • end_schedule_step: A new hook called at the end of a scheduler step, it saves the full “swap plan” to the scheduler output. This simplifies the code by avoiding the need to thread scheduler state through the KV cache manager internals.

BlockPool

Inside KV cache manager, we refactor and derive from the BlockPool class, allowing for tier-specific implementations.

Abstract Base Class: BlockPool has the following methods:

  • get_num_free_blocks()
  • get_usage()
  • get_cached_block(self, block_hash: BlockHashType) -> Optional[KVCacheBlock]
  • get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]
  • _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool:
    • to support eviction to next tier storage, this can take in another block pool to shelter the evicted block
  • (new)_maybe_shelter_evicted_block:
    • Optionally used by a lower-tier block pool to shelter blocks evicted from an upper tier.

GpuBlockPool(BlockPool): Contains the current logic used for managing GPU memory. CpuBlockPool(BlockPool): A new implementation to manage CPU-side KV cache blocks.

User-Facing Configuration

We can repurpose the existing --swap-space flag (previously unused in V1) to control the number of CPU cache blocks. However, the current default of 4GB may need to be re-evaluated.

Performance

In the initial version, we will try to hide the transfer latency with async transfer (i.e. pinned memory, MemcpyAsync and/or separate streams). On the CPU eviction policy, we will use round-robin for simplicity first, LRU will be added next.

In the future, we could employ more sophisticated techniques such as prefetch, eager swap-out and layer-wise transfer (implementation can probably be shared with disaggregation) to further hide the transfer latency.

Plan

  1. Refactor BlockPool
  2. Add CpuBlockPool and the rest of KV cache manager logic to generate "swap-plan" for each scheduler step
  3. Rest of the logic that consumes and the "swap-plan" and execute the swaps
  4. benchmark and low hanging fruit optimization

Feedback Period.

one week

CC List.

@comaniac @WoosukKwon @zhuohan123 @simon-mo

Any Other Things.

An initial prototype is implemented in #13377

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

mengzhu28 avatar Apr 07 '25 02:04 mengzhu28

Thanks for the RFC. In general I agree that KV cache offloading could be useful in certain scenarios, and the proposed approach looks reasonable to me. Meanwhile, please add the following discussions to the RFC, as they are important and we should put a record for future reference.

  1. Could you clarify and analyze why not integrating existing solutions such as LMCache for KV cache offloading? Also how this RFC could potentially be compatible with LMCache?
  2. How this RFC work with hybrid memory allocator?

For the proposed design, one lesson we learned in V0 is introducing KV cache offloading (i.e., CPU allocator) could make the KV cache manager super complex. Thus, I hope to hide the complexity of KV cache loading and reuse the current code as much as possible. For example, the related changes should be well-isolated in separate functions and data structures. We could even consider having another KVCacheManager for CPU, and let the current KVCacheManager interact with it if possible, so that we could potentially reuse many logic in block pool, etc.

Accordingly, I'm wondering whether we could still have one BlockPool implementation for different kinds of blocks. After all, the block pool data structure doesn't really care which device these blocks are on.

Also cc @WoosukKwon @heheda12345

comaniac avatar Apr 07 '25 22:04 comaniac

Thanks for the RFC. I think we can keep a light-weighted CPU KV cache offloading implementation inside vLLM and rely on projects like LMCache for more advanced KV cache management. Some questions from my side:

  1. Can KV cache offloading share the same interface with disaggregated prefill? Basically, both offloading and PD needs to send the KV cache to somewhere outside the current GPU and fetch the KV cache from a somewhere outside GPU, so it seems possible.
  2. What is the relationship between CPU memory pool and GPU memory pool? Should the GPU pool be a subset of CPU pool, or GPU pool and CPU pool are holding distinct set of KV cache?
  3. What will the software architecture be if we want to enable more optimizations, e.g., prefetching, layer-wise offloading. The prefetching may also need to change to the scheduler.

heheda12345 avatar Apr 08 '25 17:04 heheda12345

Thanks for the comments @comaniac @heheda12345 ! Re: LLM cache. Yeah, I think a lightweight solution where user can just flip a flag to use the CPU memory without any other dependency would be very handy. That being said, it is possible to reuse some of the same vllm hooks/interface in both cases (see disaggregation related discussion below).

Re: hybrid allocator. It should work with the hybrid allocator. Some small changes with different KV cache shape might be required in e.g. the CPU KV tensor initialization code.

Re: KV cache manager complexity and sharing interface with disaggregation. Yes, that's a great point. As mentioned, features like layer-wise transfer/recv are highly relevant in both offload and disaggregation. I just discovered #15960. There is actually quite a few design commonalities: both accumulates swap/send/recv during a scheduler step and workers will execute them, I think it might make sense build offload on top the KV connector (i.e. as a special type of connector). (cc @ApostaC since you authored kv connector. Context: we are thinking building CPU kv offload on top of the KV connector).

Re: relationship between CPU and GPU pool. In this design, there is no subset relationship, a block can reside in both CPU and GPU (in the case of eager offload), or just CPU (in the case of lazy eviction), and ofc only on GPU.

mengzhu28 avatar Apr 11 '25 22:04 mengzhu28

@mengzhu28 Thanks for the RFC and the prototype PR.

The following is my suggestion

  • Do an abstraction first and make the default config behavior clear and simple.
  • As we tested and do something improvements based on your prototype PR, I found the lazy mode and eager mode for swap out is the same eventually. For production use case, the GPU cache hit rate is actually extremely low, since the inference engine is very busy, so CPU blocks design is necessary part.
  • This cpu offload feature should be co-exist with lmcache, and this can be a built-in feature, and lmcache can be an advanced feature for offloading.

Whatever, offloading is a necessary feature for V1.

maobaolong avatar Apr 12 '25 00:04 maobaolong

@chunxiaozheng how about your idea?

maobaolong avatar Apr 12 '25 00:04 maobaolong

@chunxiaozheng how about your idea?

I think so. In addition, i have some questions

  • the eviction algorithm for CPU blocks currently uses FIFO, which is basically unusable in production, this can be consistent with GPU blocks
  • add some metrics for CPU offloading

In fact, we have already implemented the above features in our internal version and will separately propose some PRs

chunxiaozheng avatar Apr 14 '25 03:04 chunxiaozheng

hi~ @mengzhu28 What do you think of the idea above?

chunxiaozheng avatar Apr 16 '25 02:04 chunxiaozheng

the eviction algorithm for CPU blocks currently uses FIFO, which is basically unusable in production, this can be consistent with GPU blocks add some metrics for CPU offloading

Glad to hear that you guys are already testing the PR in your env! Both improvement make sense. The FIFO was a placeholder policy, you should definitely customize it depend on your workload.

mengzhu28 avatar Apr 16 '25 21:04 mengzhu28

the eviction algorithm for CPU blocks currently uses FIFO, which is basically unusable in production, this can be consistent with GPU blocks add some metrics for CPU offloading

Glad to hear that you guys are already testing the PR in your env! Both improvement make sense. The FIFO was a placeholder policy, you should definitely customize it depend on your workload.

I see. I think the current priority is to abstract interfaces, such as abstracting AbstractBlockPool and AbstractKVCacheManager from BlockPool and KVCacheManager, so that we can implement CpuOffloadingBlockPool and CpuOffloadingKVCacheManager on them, which is convenient for extension and minimizes intrusion into existing code.

chunxiaozheng avatar Apr 17 '25 02:04 chunxiaozheng

@chunxiaozheng

I see. I think the current priority is to abstract interfaces, such as abstracting AbstractBlockPool and AbstractKVCacheManager from BlockPool and KVCacheManager, so that we can implement CpuOffloadingBlockPool and CpuOffloadingKVCacheManager on them, which is convenient for extension and minimizes intrusion into existing code.

Sounds good, it would be nice to do an abstraction to existing code first, just like the v0 cpu offload implementation.

maobaolong avatar Apr 17 '25 03:04 maobaolong

Hi @mengzhu28, thanks for working on this. I am curious if you have plan to finalize the change and the associated PR anytime soon?

Besides, regarding having cpu offloading to work with vLLM v1 disagg (https://github.com/vllm-project/vllm/pull/15960), if we build cpu offloading on top of kv connector, then at a given time, we could only use kv connector for cpu offloading or to run disagg, unless we implement some super kv connector that could do both cpu offloading + disagg remote communication.

I think it's still beneficial to have some "native" cpu offloading feature within vLLM so that people have the capability to use any kv connector for their disagg use case while still having cpu offloading enabled.

liuzijing2014 avatar May 02 '25 05:05 liuzijing2014

Hi @mengzhu28, thanks for working on this. I am curious if you have plan to finalize the change and the associated PR anytime soon?

Besides, regarding having cpu offloading to work with vLLM v1 disagg (#15960), if we build cpu offloading on top of kv connector, then at a given time, we could only use kv connector for cpu offloading or to run disagg, unless we implement some super kv connector that could do both cpu offloading + disagg remote communication.

I think it's still beneficial to have some "native" cpu offloading feature within vLLM so that people have the capability to use any kv connector for their disagg use case while still having cpu offloading enabled.

hey @liuzijing2014, yes, we do plan to support CPU offloading as a native core feature without any other dependencies. We are actively working on this, please stay tuned.

mengzhu28 avatar May 03 '25 05:05 mengzhu28

@mengzhu28 Thanks for the works on this feature, do you still work on this?

maobaolong avatar Jun 05 '25 01:06 maobaolong

ping @mengzhu28

maobaolong avatar Jul 10 '25 06:07 maobaolong

@maobaolong I think there is another ongoing effort for CPU offloading: #19854

ApostaC avatar Jul 10 '25 16:07 ApostaC

@ApostaC Thanks for the information

maobaolong avatar Jul 10 '25 23:07 maobaolong

Closing as superseded by https://github.com/vllm-project/vllm/issues/22605

ywang96 avatar Oct 07 '25 08:10 ywang96