sglang
sglang copied to clipboard
Hierarchical Caching for SGLang
Motivation
While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.
To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.
Modifications
- A
HiRadixCachethat extendsRadixCachewith host memory addresses and synchronization mechanisms. - A host memory pool that synchronizes with the device memory pool of KV caches.
- A memory controller that implements efficient data transfer between host and device, and handles various cache write policies for hierarchical caching.
Todo:
- Update benchmark results.
- Remove deprecated design and implementation.
Checklist
- [ ] Format your code according to the Contributor Guide.
- [ ] Add unit tests as outlined in the Contributor Guide.
- [ ] Update documentation as needed, including docstrings or example tutorials.
It's amazing! Happy new year!
While collecting performance numbers, I am breaking this PR into multiple small ones for easier reviewing (WIP): https://github.com/sgl-project/sglang/pull/2771 https://github.com/sgl-project/sglang/pull/2804 https://github.com/sgl-project/sglang/pull/2941 https://github.com/sgl-project/sglang/pull/2942 https://github.com/sgl-project/sglang/pull/3171 ...
Hi,
Thanks for the great work. I am leaving a comment because my team is working on something similar.
As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.
In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.
We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.
Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.
I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.
Thanks. Heejun and @mujjingun
Hi,
Thanks for the great work. I am leaving a comment because my team is working on something similar.
As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.
In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.
We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.
Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.
I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.
Thanks. Heejun and @mujjingun
Thank you @gmlwns2000 for the great work. HiP attention looks promising for long context and I am looking forward to combing our effort in optimizing data transfer between host and devices. Please let me know if you might need any help on contributing it to SGLang : )
Hi,
Thanks for the great work. I am leaving a comment because my team is working on something similar.
As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.
In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.
We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.
Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.
I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.
Thanks. Heejun and @mujjingun
Great work. However, UVM will only provide an extension to CPU memory, and going to disk will cause significant performance regressions. The right design for KV cache offloading is explicit management.
Pruning is a different concept and should not be attached to this PR. It should be a separate technique.
Please understand pruning is more like how the KV cache is organized and stored, and not managed. This PR targets the management side of the KV cache!
@zhyncs @ByronHsu Hey fellas - now you know what I was up to :P
@msharmavikram Thanks for advice!
I was planning to make a new PR (Adding HiP attention, Support training-free context extension, Support UVM KV cache offloading for decoding), but I just wanted to ask the author of this PR before I try to integrate Hierarchical Caching with my method about whether that idea is good to go.
I think Hierarchical caching and UVM caching should be integrated because, in the long-context scenarios, the GPU memory is significantly limited in many use cases (consumer GPUs such as 4090 only have 24GB, which is good to handle up to 64~128K tokens. But I want to handle around 1M tokens to match up Gemini) We can extend the single sequence length by using UVM, but if then we are running out of CPU-GPU memory so we cannot utilize the radix caching properly. That is why I am trying to look into this PR.
However, I am currently working on other things (paper writing), so my new PR is getting delayed. I am sorry about that.
In addition to this, @xiezhq-hermann, I have concerns about the license of my HiP attention in my future PR. Can you check the new https://github.com/sgl-project/sglang/discussions/3042 discussion I just made?
Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).
Now I understand that the concept of hierarchical coaching is trying to aim more general framework than what I thought. I will keep watching this PR, and I will implement this hierarchical caching for my attention mechanism in the future by following the proposed implementation in this PR.
Thanks!
Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).
I think UVM relies on page faults to fetch data, which has much higher overhead than writing a cache controller and can cause trashing? You can indeed use cudaMemPrefetchAsync and cudaMemAdvise, but no support from PyTorch(https://github.com/pytorch/pytorch/pull/106200) probably due to the above reasons.
@Edenzzzz Yes, I used cudaMemAdvise to make the pages stay mostly in the CPU. So, if what I understand is correct, the latency should be stable from the CPU side. However, I am not sure that speed is enough because I have never tested the CPU read latency after a few iterations of decoding requests. I think your concern is quite important, and I will check this issue when I start working on integrating the UVM cache into Hierarchical Caching as a cache layer.
Thanks for giving a comment.
After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : )
Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:
DeepSeek MLA is not supported yet, and an error will be reported when starting the model:
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__
HiRadixCache(
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__
self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool)
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__
self.head_num = device_pool.head_num
AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'
DeepSeek MLA is not supported yet, and an error will be reported when starting the model:
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1849, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 305, in __init__ HiRadixCache( File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 26, in __init__ self.token_to_kv_pool_host = MLATokenToKVPoolHost(token_to_kv_pool) File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 461, in __init__ self.head_num = device_pool.head_num AttributeError: 'MLATokenToKVPool' object has no attribute 'head_num'
Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.
Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.
Thanks @xiezhq-hermann
Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. For further question about this feature, feel free to reach out to me on SGLang slack for a more prompt reply.
Thanks @xiezhq-hermann
@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: https://github.com/sgl-project/sglang/pull/4009
@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009
@xiezhq-hermann Thanks, but I've encountered a problem. I just experimented with https://github.com/sgl-project/sglang/pull/4009 and found that there is indeed a concurrency problem when TP>1. The program will enter a locked state. There may be a concurrency problem. Please follow up. Thank you!
After code cleaning and basic performance benchmark, this PR is ready to merge. You can add
--enable-hierarchical-cacheoption when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:
Besides --enable-hierarchical-cache, do we also need to set cpu_offload_gb?
After code cleaning and basic performance benchmark, this PR is ready to merge. You can add
--enable-hierarchical-cacheoption when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : ) Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:Besides
--enable-hierarchical-cache, do we also need to setcpu_offload_gb?
Right now it allocate a host memory pool which is 4 times of the size of the device memory pool by default, so no need to set other things but more options will be added.
Hi I'm wondering - when are you planning to support L3 cache? I think it's reasonable to support pluggable L3 caches, which encourages storage providers to implement their L3 caches according to their product features. What you need to do is to define a bunch of kv cache apis for getting/putting/evicting kv cache chunk/item and give them some demo implentation using something like local SSD.
This is in works @wangyibin-gh !
This is in works @wangyibin-gh !
when do you expect this feature can be merged? and btw is there any documentation about it, especially w.r.t the APIs.
