sglang Hierarchical Caching for SGLang

trafficstars

Motivation

While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.

To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.

Modifications

A HiRadixCache that extends RadixCache with host memory addresses and synchronization mechanisms.
A host memory pool that synchronizes with the device memory pool of KV caches.
A memory controller that implements efficient data transfer between host and device, and handles various cache write policies for hierarchical caching.

Todo:

Update benchmark results.
Remove deprecated design and implementation.

Checklist

[ ] Format your code according to the Contributor Guide.
[ ] Add unit tests as outlined in the Contributor Guide.
[ ] Update documentation as needed, including docstrings or example tutorials.

Jan 01 '25 07:01 xiezhq-hermann

It's amazing! Happy new year!

Jan 01 '25 08:01 zhyncs

While collecting performance numbers, I am breaking this PR into multiple small ones for easier reviewing (WIP): https://github.com/sgl-project/sglang/pull/2771 https://github.com/sgl-project/sglang/pull/2804 https://github.com/sgl-project/sglang/pull/2941 https://github.com/sgl-project/sglang/pull/2942 https://github.com/sgl-project/sglang/pull/3171 ...

Jan 07 '25 08:01 xiezhq-hermann

Hi,

Thanks for the great work. I am leaving a comment because my team is working on something similar.

As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.

In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.

We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.

Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.

I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.

Thanks. Heejun and @mujjingun

Jan 10 '25 13:01 gmlwns2000

Hi,

Thanks for the great work. I am leaving a comment because my team is working on something similar.

As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.

In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.

We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.

Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.

I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.

Thanks. Heejun and @mujjingun

Thank you @gmlwns2000 for the great work. HiP attention looks promising for long context and I am looking forward to combing our effort in optimizing data transfer between host and devices. Please let me know if you might need any help on contributing it to SGLang : )

Jan 15 '25 08:01 xiezhq-hermann

Hi,

Thanks for the great work. I am leaving a comment because my team is working on something similar.

As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.

In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.

We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.

Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.

I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.

Thanks. Heejun and @mujjingun

Great work. However, UVM will only provide an extension to CPU memory, and going to disk will cause significant performance regressions. The right design for KV cache offloading is explicit management.

Pruning is a different concept and should not be attached to this PR. It should be a separate technique.

Please understand pruning is more like how the KV cache is organized and stored, and not managed. This PR targets the management side of the KV cache!

@zhyncs @ByronHsu Hey fellas - now you know what I was up to :P

Jan 22 '25 03:01 msharmavikram

@msharmavikram Thanks for advice!

I was planning to make a new PR (Adding HiP attention, Support training-free context extension, Support UVM KV cache offloading for decoding), but I just wanted to ask the author of this PR before I try to integrate Hierarchical Caching with my method about whether that idea is good to go.

I think Hierarchical caching and UVM caching should be integrated because, in the long-context scenarios, the GPU memory is significantly limited in many use cases (consumer GPUs such as 4090 only have 24GB, which is good to handle up to 64~128K tokens. But I want to handle around 1M tokens to match up Gemini) We can extend the single sequence length by using UVM, but if then we are running out of CPU-GPU memory so we cannot utilize the radix caching properly. That is why I am trying to look into this PR.

However, I am currently working on other things (paper writing), so my new PR is getting delayed. I am sorry about that.

In addition to this, @xiezhq-hermann, I have concerns about the license of my HiP attention in my future PR. Can you check the new https://github.com/sgl-project/sglang/discussions/3042 discussion I just made?

Jan 22 '25 06:01 gmlwns2000

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

Jan 22 '25 17:01 msharmavikram

Now I understand that the concept of hierarchical coaching is trying to aim more general framework than what I thought. I will keep watching this PR, and I will implement this hierarchical caching for my attention mechanism in the future by following the proposed implementation in this PR.

Thanks!

Jan 22 '25 18:01 gmlwns2000

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

I think UVM relies on page faults to fetch data, which has much higher overhead than writing a cache controller and can cause trashing? You can indeed use cudaMemPrefetchAsync and cudaMemAdvise, but no support from PyTorch(https://github.com/pytorch/pytorch/pull/106200) probably due to the above reasons.

Jan 26 '25 14:01 Edenzzzz

@Edenzzzz Yes, I used cudaMemAdvise to make the pages stay mostly in the CPU. So, if what I understand is correct, the latency should be stable from the CPU side. However, I am not sure that speed is enough because I have never tested the CPU read latency after a few iterations of decoding requests. I think your concern is quite important, and I will check this issue when I start working on integrating the UVM cache into Hierarchical Caching as a cache layer.

Thanks for giving a comment.

Jan 26 '25 15:01 gmlwns2000

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

throughput_latency_curve

Jan 28 '25 06:01 xiezhq-hermann

DeepSeek MLA is not supported yet, and an error will be reported when starting the model:

  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__
    HiRadixCache(
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__
    self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool)
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__
    self.head_num = device_pool.head_num
AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'

Feb 25 '25 02:02 lambert0312

DeepSeek MLA is not supported yet, and an error will be reported when starting the model:

  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__
    HiRadixCache(
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__
    self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool)
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__
    self.head_num = device_pool.head_num
AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

Feb 25 '25 03:02 xiezhq-hermann

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

Thanks @xiezhq-hermann

Feb 25 '25 03:02 lambert0312

Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.

Thanks @xiezhq-hermann

@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: https://github.com/sgl-project/sglang/pull/4009

Mar 04 '25 09:03 xiezhq-hermann

@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009

@xiezhq-hermann Thanks, but I've encountered a problem. I just experimented with https://github.com/sgl-project/sglang/pull/4009 and found that there is indeed a concurrency problem when TP>1. The program will enter a locked state. There may be a concurrency problem. Please follow up. Thank you!

Mar 04 '25 12:03 lambert0312

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

Besides --enable-hierarchical-cache, do we also need to set cpu_offload_gb?

Mar 04 '25 21:03 shensimeteor

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

Besides --enable-hierarchical-cache, do we also need to set cpu_offload_gb?

Right now it allocate a host memory pool which is 4 times of the size of the device memory pool by default, so no need to set other things but more options will be added.

Mar 05 '25 09:03 xiezhq-hermann

Hi I'm wondering - when are you planning to support L3 cache? I think it's reasonable to support pluggable L3 caches, which encourages storage providers to implement their L3 caches according to their product features. What you need to do is to define a bunch of kv cache apis for getting/putting/evicting kv cache chunk/item and give them some demo implentation using something like local SSD.

Jun 24 '25 09:06 wangyibin-gh

This is in works @wangyibin-gh !

Jun 24 '25 12:06 msharmavikram

This is in works @wangyibin-gh !

when do you expect this feature can be merged? and btw is there any documentation about it, especially w.r.t the APIs.

Jun 25 '25 02:06 wangyibin-gh

sglang sglang copied to clipboard

Hierarchical Caching for SGLang

Motivation

Modifications

Todo:

Checklist

sglang
sglang copied to clipboard