Add Fine-Grained Server-Level Routing in ModelRouter for Multi-Server Pods
🚀 Feature Description and Motivation
Currently, Aibrix's ModelRouter routes requests at the Pod level, which means that requests are routed to an entire Pod rather than an individual server or inference service within the Pod. However, in scenarios where multiple inference services (such as DP or TP domains) are running within a single Pod, this approach does not offer the necessary granularity to effectively distribute load and ensure efficient utilization of resources.
Problem:
Pod-Level Routing: The ModelRouter only routes requests to the entire Pod, not individual inference servers within that Pod. This causes inefficiency in scenarios where there are multiple services within the same Pod (e.g., multiple GPUs or DP domains).
Multi-Server Pods: When a Pod contains multiple inference services (like DP domains or GPUs), routing requests to the entire Pod can lead to resource bottlenecks, inefficient load balancing, and suboptimal performance.
Lack of Granularity: There's no mechanism to route requests to specific servers within the Pod, resulting in potential overload of certain servers while others remain underutilized.
Use Case
In scenarios where a Pod contains two different DP domains or multiple GPUs, we want the ModelRouter to route requests to specific servers within the Pod:
Example Pod with two servers (DP domains): Pod 1: ├── dp0: http://pod1:8000 ├── dp1: http://pod1:8001 Currently, ModelRouter would route requests to Pod 1 as a whole, potentially leading to inefficiencies. With server-level routing, ModelRouter would be able to route requests directly to either dp0 or dp1 based on factors like load and health.
I would like to submit a request for the implementation of server-level routing within Aibrix's ModelRouter to address the limitations mentioned above.
Proposed Solution
I propose an enhancement to Aibrix's ModelRouter to support server-level routing. The key idea is to:
Track and register each inference server (e.g., dp0, dp1, etc.) within a Pod separately, enabling granular routing decisions.
Route requests to specific servers (rather than Pods) based on load, health, and other relevant factors.
Introduce health checks and failure management mechanisms at the server level to enhance fault tolerance and reliability.
/cc
one vllm for multi-model or different vllm processes?
one vllm for multi-model or different vllm processes?
If we want true server-level routing, we should run multiple vLLM processes inside one Pod. A single vLLM instance managing multiple models cannot expose routing granularity within the Pod, so routing will remain at the Pod level.
我认为对于目前vllm的睡眠模式, 基于进程的调度是有实际可落地场景的. 目前我正在进行相关的调研 在一个pod中启动两个进程(对应两个模型), 一个睡眠 一个服务, 通过routing策略唤醒/睡眠其中一个模型, 实现模型切换, 增加GPU的利用率(相同数量GPU运行更多的模型)
抱歉,之前描述可能让你产生了误解,我的场景是分布式dp,每个dp域(比如8张卡,dp=2,会有2个dp域即每四张卡一组)会启动一个api-server,然后当前AIbrix的pd分离的路由算法主要是路由到pod上,我的场景路由算法不是选择pod而是api-server,我目前正在使用aibrix做一些尝试