[RFC]: Support for Multi-Tenant Model Deployments and Tenant-Aware Routing in AIBrix
🚀 Feature Description and Motivation
I'm proposing multi-tenant model deployment and routing support in AIBrix because the current model-centric design assumes shared deployments, which limits its applicability in SaaS platforms. This would help solve the lack of runtime isolation and routing separation between tenants using the same base model, and benefit AIBrix users by enabling enterprise-grade multitenancy with strict resource and access boundaries while maintaining shared infrastructure efficiency.
Use Case
In my enterprise deployment, I often need to serve the same LLM (e.g., LLaMA 3.1 70B) to multiple customers, each with their own scaling, caching, and SLA requirements. These tenants must never share runtime pods, KV cache memory, or autoscaling logic.
This feature would allow me to deploy the same base model across tenants in isolated pods, while still using shared gateways and control planes, ensuring predictable performance, observability, and cost efficiency.
Proposed Solution
One possible approach could be to extend the routing and pod orchestration logic to include tenant context as part of the model key (tenant_id:model_id). Two complementary routing strategies are proposed:
Option 1: Hierarchical Routing Layer Introduce a two-level cache keyed by tenant → model
Keep current model-level routing logic intact
Maintain clarity and separation of routing concerns
Option 2: Dynamic Label-Based Routing Use composite keys formed from tenant and model labels
Match requests via dynamically structured HTTPRoutes
More flexible and future-proof
This would integrate with AIBrix's existing Envoy ext_proc-based gateway (gateway.go) by modifying HandleRequestBody and selectTargetPod to compute routing decisions based on the extended label set, and updating the internal routing cache accordingly.
@ModiIntel I will ping you on slack and schedule sometime to discuss different alternatives.
Another option can be to split deployment identifier from the model name. cc https://github.com/vllm-project/aibrix/issues/1086
- Each tenant can deploy same model and have different deployment identifier to be passed as a request header.
- If deployment identifier is not present then current workflow will be followed.
Another feature to add is, to generate unique token for each user + model deployment to be passed for each request for authentication purpose.
- We can discuss if this is required feature for multi-tenancy right now.
had a discussion with @ModiIntel, and a deployment/tenant identifier raised in this issue https://github.com/vllm-project/aibrix/issues/1086 can help resolve multi-tenancy use as well. @ModiIntel will help drive the design and implementation.
From end-user perspective request will look something like this (not finalized)
curl -v http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key-1234567890" \
-H "tenant-id: llama2-7b-team1" \
-d '{
"model": "llama2-7b",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
@ModiIntel will share the implementation design doc for broader reach.
@varungup90 Multi-Tenant Routing in AIBrix
I've been thinking about how we can best implement multi-tenancy in AIBrix, and I'd like to get your thoughts on two possible approaches, with a particular focus on performance isolation.
Option 1: Complete Namespace Isolation
Think of this as giving each tenant their own private space in the cluster. Every tenant gets a dedicated gateway server running in their namespace that only watches and manages their own models and deployments. It's like having separate apartments in a building, where each tenant has their own entrance, utilities, and complete privacy from others - ensuring that one tenant's heavy usage can't impact the performance of others' workloads.
The implementation would require us to modify the current informer setup to be namespace-aware, deploy separate gateway instances per namespace, and ensure all the routing components (HTTPRoutes, policies) stay within their designated namespace boundaries. This gives us true isolation but at the cost of running multiple gateway instances.
This approach particularly shines when tenants need strict isolation and independent scaling. However, it does mean we'll be using more cluster resources since each namespace needs its own gateway infrastructure.
Option 2: Enhanced Shared Gateway with Tenant Context
This is more like a well-organized shared office space. We keep our current single gateway architecture but enhance it to understand and respect tenant boundaries. The gateway becomes tenant-aware, maintaining separate logical spaces within a shared physical infrastructure. However, since all tenants share the same gateway server and cache layer, a tenant with high request volume or frequent pod updates could potentially impact the routing performance for other tenants' requests.
We'd need to modify our existing cache layer to track tenant context alongside the current model and pod information. The routing logic would be enhanced to consider tenant ownership when making decisions, but all through a single, efficient gateway instance.
This approach is more resource-efficient and easier to maintain since there's only one gateway to manage. However, we need to be more careful with the implementation to ensure proper isolation within the shared infrastructure.
My Take
I believe implementing true tenant isolation through namespace separation (Option 1) would be the more robust approach in the long term, especially for enterprise deployments where isolation guarantees are critical.
However, I understand that starting with the simpler shared gateway design might be more practical for initial implementation and testing. Would you prefer we start with the simpler shared gateway approach and evolve towards namespace isolation as needed?
I've added a more detailed design document that outlines the shared gateway approach, including implementation considerations.
cc @varungup90 @Jeffwan - Would really appreciate your thoughts on this, particularly regarding the gateway architecture implications.
Thanks!
In proposed approach 1 and 2, isolation is only at gateway component but actual GPU resources are shared. Gateway is horizontally scalable, so I do not see a value for installing separate gateway instances under different namespace in same k8s cluster. Right now if user desires, they can always create multiple kubernetes cluster (each running one instance of gateway control plane).
Use case around multi-tenancy isolation is that we can support same model deployment for multiple tenants using one gateway control plane, right now gateway restricts unique model deployment. In future, we can extend multi-tenancy to include user authentication to ensure they can only run inference request for their models (since all models are sharing same gateway).
For implementation, I am thinking to add tenant label identifier in service and deployment spec which can be used to create unique httproute to identify it. Secondly, this label identifier is required when user sends request to identify model routing.
Thank you for your insights on multi-tenant model deployments. I wanted to confirm that the design document I shared earlier already outlines an approach that aligns with your suggestion.
The design uses composite keys that combine tenant identifiers with model names, allowing the same model to be deployed for multiple tenants while maintaining proper isolation. This approach supports:
- Unique model deployments per tenant through tenant identifiers in request headers
- Backward compatibility by falling back to the current behavior when no tenant ID is specified
- Proper routing and isolation at the model deployment level
The composite key approach (tenant_id/model_id) provides the foundation we need without requiring separate gateway instances per tenant, while still enabling proper resource isolation.
Rather than debating the scalability aspects, I think we're aligned on the core approach of using tenant identifiers to enable multiple deployments of the same model for different tenants.
Does this address your thoughts on the implementation approach? I'd appreciate any additional feedback on the design document as we move forward with the implementation.
cc: @Jeffwan @varungup90
@ModiCodeCraftsman I’ve reviewed the doc and overall it looks good. Just a few suggestions to ensure full compatibility:
- Tenant ID should be optional. The key builder and related logic should continue to function as they currently do, even if tenant-id is not provided.
- Consider extracting reusable designs. Some components, such as the header-to-label mapping, could be valuable as standalone PRs for better modularity and reuse.
- Cover non-HTTPRoute workflows. It would be helpful to include guidance for scenarios where users adopt custom routing mechanisms, as these aren't currently addressed in the document.
/cc @varungup90