SLO-Driven Resource Management for vLLM

Open Jeffwan opened this issue 10 months ago • 0 comments

🚀 Feature Description and Motivation

Background

Different requests have varying input/output lengths, leading to diverse resource requirements. Currently, when a batch of requests gets scheduled together, it is difficult to guarantee individual users' SLA. To address this, we want to design a solution that provides strong SLO guarantees and manages resource tiers and cost models in an SLO-driven manner. This will require engine co-design efforts.

Challenges

There're few challenges on the concepts that vLLM and external system doesn't have at this moment.

Should we use goodput as the primary metric or rely on simpler single-dimension metrics like TTFT or TPOT?
Should we define resource classes based on request profiles
How can we ensure fair allocation without underutilizing GPU resources?
How to map the SLO to Token Pricing model?

I will skip the proposal part and leave this to public discussion at this moment

Use Case

support multi-tenant use case

Proposed Solution

No response

Feb 27 '25 00:02 Jeffwan