[RFC]: Load-aware pattern-based routing policy with profile support
Summary
Having access to the GPU profile used by the GPU optimizer, we propose to add a new routing policy that utilizes performance profiles per input/output token pattern to better work with the GPU optimizer for fewer SLO violations.
Motivation
Our experiment results show that both heterogeneous and homogeneous GPU clusters need a pattern-based routing policy to minimize SLO violations.
Proposed Change
With the GPU profile, we have:
- GPU's max arrival rate (RPS) abides by the defined SLO for different models, # of input tokens, and # of output tokens. We define a pattern as P = (GPU, Model, Input tokens, Output tokens)
- (Optional) Associated performance metrics: E2E Latency (E2E), TTFT, TPOT, and the SLO used by the profile. Using Little's law, we can get the average pending requests that abide by SLO:
Pending requests(P) = RPS(P) * E2E(P)
For example, if an LLM can server (input 256, output 16) request with 4 RPS and E2E latency 0.5s, the pending requests would expect to be 4 * 0.5 = 2. More examples:
- 256/16: 4RPS, 0.5s = 2 Pending requests
- 4096/32: 1RPS, 5s = 5 Pending requests Note that the profile indicates that adding more requests will break the stable status, cause pending requests to accumulate over time, and break SLO.
We define the capacity of a pattern (P) using the number of pending requests:
Capacity(P) = Pending requests(P)
So, we can normalize the load per request corresponding to the workload and GPU per (# input tokens, # output tokens) pairs as:
Load(P) = 1 / Capacity(P) = 1 / (RPS(P) * E2E(P)
The proposed routing policy has two components:
- Load-aware routing policy
- Pattern-based queue policy
For routing policy, we propose to change from the current push-based routing strategy to a pull-based routing strategy. Push-based routing strategy simply pushes requests to the vLLM serving instance and request queue at the instance side regardless of instance capacity. The pull-based routing strategy maintains a global queue and only pulls from the global queue if the instance side has spare capacity. With GPU profiles-based capacity definition, we can implement a pull-based routing strategy in the gateway without changing the vLLM instance. We find a least-load routing strategy is sufficient.
For queue policy, we propose to maintain a request queue per profile pattern. Requests are pushed to the queue based on input tokens and history-based output token prediction. Anytime a vLLM instance has spare capacity, we compare the SLO violation possibilities of requests across request queues and pull a request from the queue with the greatest SLO violation possibilities to serve.
Alternatives Considered
We developed a queue simulator for comparing alternative policies. The queue simulator supports multiple patterns (as producers) and multiple queues (as consumers) with different capacities. Concurrent requests within queue capacity can be processed in parallel, while extra concurrent requests will have to wait. We compare the average waiting time and total SLO violations across various alternative policies.
Alternative routing policies:
- Random push
- Least load (push): Load(Consumer) can be > 1
- Least load with the global queue (pull): Load(Consumer) <= 1
- Binpack (pull):
Alternative queue policies:
- Simple queue
- SLO-aware pattern-based queue
Pattern setups:
- Ideal Bi-patterns:
- (arrival rate: 4.0, latency: 0.5s, consumer throughput: 4.1, SLO: 2s)
- (arrival rate: 4.0, latency: 5s, consumer throughput: 1.0, SLO: 20s)
- Multi-patterns from a real workload log
- (arrival rate: 0.642, latency: 6.7s, consumer throughput: 3,774, SLO: 20s)
- (arrival rate: 4.2, latency: 1.4s, consumer throughput: 8.089, SLO: 5s)
- (arrival rate: 0.535, latency: 7.3s, consumer throughput: 1.969, SLO: 20s)
- (arrival rate: 2.622, latency: 8.3s, consumer throughput: 0.999, SLO: 20s)
Result:
- Ideal Bi-patterns (single GPU)
- Waiting times: Binpack < Least load (Pull) < Least load (Push) < Random
- SLO violations: per-pattern queue < simple queue
- Ideal Bi-patterns (bi-GPU)
- Waiting times: Binpack = Least load (Pull) < Least load (Push) < Random
- SLO violations: per-pattern queue < simple queue
- Multi-patterns from a demo workload:
- Waiting times: Binpack = Least load (Pull) = Least load (Push) < Random
- SLO violations: per-pattern queue < simple queue