text-generation-inference Host/CPU memory usage for prefix cache

Feature request

Automatic usage of CPU memory for KV cache if VRAM is not enough.

Motivation

Prefix cache hit/miss ratios have a huge impact on performance. For a prefix cache to be large enough to be useful on a 48GB GPU with a ~70B model, with several users, the VRAM is generally not large enough. This is especially so for long-context applications.

Current SOTA techniques use CPU host RAM (and even SSD storage) to ensure many (long) prefixes can be stored. This opens up new use cases in VRAM-constrained environments.

Research:

To solve this problem, we developed an inter-turn caching system. For every prefilled prefix and generated message, we cache the KV values on host memory and retrieve them for future queries. Similar to RadixAttention (Zheng et al., 2023), we organize cached KV tensors in a LRU cache with a tree structure. The cached KV values are indexed by a rolling hash of prefix tokens. For each new query, a rolling hash is calculated for each prefix of the context, and the cache is retrieved for the longest match. This allows reusing the cache even for partially matched messages.

https://research.character.ai/optimizing-inference/

CacheGen [SIGCOMM'24]: efficiently encodes KV caches into bitstreams and stores them on disks. This allows unlimited amount of KV caches to be stored on cheap disks and be shared across multiple vLLM instances. It is in contrast with vLLM which stores KV caches only in one LLM instance’s internal GPU and CPU memory. This capability is pressingly needed in multiple-round chat apps that use many vLLM instances to serve many users.

https://github.com/LMCache/LMCache
https://x.com/lmcache/status/1836136395477520507
https://lmcache.github.io/2024-09-17-release/

Your contribution

This issue.

Sep 21 '24 02:09 josephrocca

We're not entirely sure this is really the way to go.

Typical deployments have multiple replicas. With CPU/disk kv-cache you need to use sticky sessions if you don't want to reproduce n-times the kv-cache over n replicas (and almost n - 1 / n probabilty to cache miss anyway).

In real settings with multiple replicas, having a centralized kv-cache might be more interesting. The problem is the size of the kv-cache would mean a lot of strain on the network

If you're using sticky sessions, having CPU/disk kv-cache makes more sense, but it opens it's own can of worms.

All in all, it might happen, but this is not our priority at the moment, we're focusing on making everything faster by doing cleverer compute first. Once that's done, we'll have better metrics on cache hit/miss rates and the time it actually takes to recompute long sequences. With that in mind, it'll be much easier to assess a correct caching solution.

Oct 16 '24 12:10 Narsil

With that in mind, it'll be much easier to assess a correct caching solution.

Gotcha, makes sense.

For reference, I use sticky sessions, and it's not much of a can of worms in my case, since I just keep track of how utilized each machine currently is, and preferentially route users to machines that they were last on so long as that machine is not too utilized. It seems to work great.

I have about a 2% rate of failing to route to the last-used machine due to it being too utilized, which means the user just has to wait a bit for the prefill for 1 in every 50 requests.

In my case I wouldn't be able to have a centralized kv cache because I have machines all over the place (entirely different datacenters), and network between them is almost certainly far too slow.

I'm anticipating more "streaming modalities" (like speech-to-speech chat and the recent counter strike model) that will mean my infrastructure will almost certainly need to be built around the idea of sticky sessions.

Oct 17 '24 08:10 josephrocca