aibrix
aibrix copied to clipboard
SLO-Driven Resource Management for vLLM
🚀 Feature Description and Motivation
Background
Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.
Challenges
There're few challenges on the concepts that vLLM and external system doesn't have at this moment.
- Should we use goodput as the primary metric or rely on simpler single-dimension metrics like TTFT or TPOT?
- Should we define resource classes based on request profiles
- How can we ensure fair allocation without underutilizing GPU resources?
- How to map the SLO to Token Pricing model?
I will skip the proposal part and leave this to public discussion at this moment
Use Case
support multi-tenant use case
Proposed Solution
No response