[Meta]: Native Inference
This tracks the roadmap for implementing native inference capabilities inside dstack. Currently LLM inference systems (SGLang, Dynamo, Grove, LLM-d, Ai-brix, SGLang OME) revolve around inference-native concepts: TTFT/ITL autoscaling, PD disaggregation, distributed KV-cache, KV cache transfer engines, router HA, and multi-node execution. For more details on how different different components constitute native inference see Exploration of Inference Architecture
The goal is to build these capabilities into dstack’s and provide a unified abstraction that works seamlessly across clouds, Kubernetes, and bare metal.
-
[ ] Prefill–Decode Disaggregation Prefill-decode disaggregation requires launching prefill and decode workers with different commands, ports, and configurations. Currently, dstack service only supports one set of commands per replica, making it impossible to run heterogeneous replicas (prefill/decode) within a single service. PD can be implemented using
multiple service configs, each service representing either prefill/decode or can be implemented usingreplica_groups.The recommended implementation is with replica_groups. -
[ ] Gateway HA Currently, dstack does not provide HA for Gateway Service. This can be implemented by introducing multiple Gateways. SGLang router already has a PR for router HA, which means after the merge the point of unreliability will only be Gateway.
-
[ ] Autoscaling dstack currently supports only RPS-based autoscaling, but inference systems require autoscaling based on inference-native performance metrics. We need to extend the autoscaler to incorporate TTFT (Time to First Token) and ITL (Inter-Token Latency). For design details see: link
-
[ ] Multi-node Replicas Currently, dstack does not support multi-node service replicas. Each service replica is limited to a single node (one job per replica). To support multi-node service replicas, dstack services would need to support multiple jobs per replica.