dstack [Feature]: Implement Cache-Aware Routing for `dstack` Services

Problem

Hi dstack team and community,

First off, thanks for creating dstack! It's a fantastic tool that has really simplified our inference infrastructure.

We use dstack for our LLM inference workloads. One significant optimization for LLMs is KV cache reuse (prefix caching), where computation for initial tokens (prefill) can be skipped if the prefix matches a previous request, significantly improving performance. See:

sglang: https://lmsys.org/blog/2024-01-17-sglang/
vllm: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html

Currently, dstack's service load balancing likely uses standard strategies (e.g., round-robin). While great for general load distribution, these policies are cache-agnostic. A request is often sent to an instance that doesn't have the relevant prefix cached, even if another instance does. This leads to redundant prefill computation, resulting in higher Time To First Token (TTFT) and lower overall throughput.

Proposal: Add Cache-Aware Routing to dstack Services

Inspired by recent work in systems like SGLang, Nvidia Dynamo, and KubeAI, I propose adding support for a cache-aware routing policy to help better meet SLOs for LLM inference. The goal is to route incoming inference requests to the specific inference instance that is most likely to have the request's prefix already in its KV cache.

Benefits

Improved Inference Performance: Significantly lower latency (especially TTFT) and higher throughput, particularly for workloads with shared prefixes (e.g., conversational AI, RAG systems with common instructions/prompts).
Better Resource Utilization: Reduces redundant computation across instances, potentially allowing for serving higher loads with the same hardware or reducing costs.

Call for Discussion: I'm starting work on a private fork to add this capability to dstack and would love to contribute it upstream when ready!

Would this be a useful contribution to the community? If so, it'd be great to discuss the high-level implementation approach here to align efforts and ensure a smooth contribution process later on.

Thanks for considering my proposal :)

References

SGLang Router Design: https://docs.google.com/document/d/1cCqK3dh7ZR_rUPkcZT2cr0kLnAxv6_Sd-P1q37-3RNQ/edit?tab=t.0
KubeAI Load Balancing: https://www.kubeai.org/blog/2025/02/26/llm-load-balancing-at-scale-chwbl/
Nvidia Dynamo KV Cache Routing: https://github.com/ai-dynamo/dynamo/blob/main/docs/kv_cache_routing.md

Solution

No response

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Mar 28 '25 14:03 silentlustre

Thank you @silentlustre for a great request. We're actually aware of this and very much interested to bring this to dstack too. I wonder if you have any capacity to contribute. In this case, we'll be able to bring this even faster. Plan to discuss more this with the team and get back to you with the plans!

Mar 28 '25 15:03 peterschmidt85

Really glad to hear this! I'll send over a some proposals of implementation directions I have in mind. I'm sure some of them would be off the mark but that's what collaboration is for :)

Mar 28 '25 17:03 silentlustre

This issue is stale because it has been open for 30 days with no activity.

Jun 08 '25 02:06 github-actions[bot]