v0.4.0 roadmap
🚀 Feature Description and Motivation
We’re actively evolving AIBrix to support more advanced and production-ready LLM serving capabilities. For v0.4.0 and beyond, our roadmap includes:
- Prefill & Decode Disaggregation: Enable architectural support for separating prefill and decode stages across devices or nodes to maximize throughput and resource utilization.
- KVCache Offloading Framework Evolution: Extend support to the vLLM v1 architecture, which introduces layer-by-layer pipelined KV transmission—enabling lower latency and better parallelism compared to the v0 design.
- Multi-Tenancy & Isolation: Introduce tenancy as a first-class concept in AIBrix—supporting per-tenant model isolation, request segregation, and fine-grained SLO control for production use cases.
- (Optional) Batch Inference & Request Collocation: Optimize request routing across heterogeneous GPU types to improve cost-efficiency, particularly under mixed workload conditions.
Stay tuned for our upcoming v0.4.0 roadmap update! If you're interested in contributing new features or helping shape the direction of AIBrix, we’d love to hear from you.
Use Case
N/A
Proposed Solution
No response
- Httproute should update if the deployment or rayclusterfleet label
modelIdentifierchange withhttps://github.com/vllm-project/aibrix/blob/main/pkg/controller/modelrouter/modelrouter_controller.go#L87, and if split deployment identifier from model deployment, maybe the labelmodelIdentifiernot need to chang. - Rayclusterfleet replica scale to 0, sometime the raycluster is not reduced to 0, and it also need to repair.
- For the
Heterogeneous GPU Inference, we should increase the mechanism of implementing heterogeneity through ray workgroups, such as
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: RayClusterFleet
metadata:
annotations:
prometheus.io/custom: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
creationTimestamp: "2025-05-19T06:04:46Z"
generation: 1
labels:
app: mix-ray-cards-review-v3
global/index: 682aa920d5cdaaab68ccae82
k8s.io/priority: P3
k8s.io/product.type: perception
k8s.io/trace.env: test
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
name: review-v3-hg-sc-682aa920d5cdaaab6
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
strategy:
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/custom: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
ray.io/overwrite-container-cmd: "true"
labels:
app: mix-ray-cards-review-v3
global/index: 682aa920d5cdaaab68ccae82
k8s.io/priority: P3
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
name: review-v3-hg-sc-682aa920d5cdaaab6
spec:
autoscalerOptions:
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
upscalingMode: Conservative
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams:
block: "false"
dashboard-host: 0.0.0.0
template:
metadata:
annotations:
....
labels:
....
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
spec:
containers:
- args:
- ulimit -n 65536;echo head;$KUBERAY_GEN_RAY_START_CMD;python3 -m vllm.entrypoints.openai.api_server
--port 8000 --model /models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B --tensor-parallel-size
1 --pipeline-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len
4096 --served-model-name mix-ray-cards-review-v3-uuzj6l0b --uvicorn-log-level
warning --trust-remote-code;
command:
- /bin/bash
- -c
- --
env:
- name: HBOX_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: RANK
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: vllm-openai:v0.8.4-ds
imagePullPolicy: IfNotPresent
name: ray-head
ports:
- containerPort: 6379
name: gcs-server
protocol: TCP
- containerPort: 8265
name: dashboard
protocol: TCP
- containerPort: 10001
name: client
protocol: TCP
- containerPort: 8000
name: service
protocol: TCP
resources:
limits:
cpu: "11"
memory: 120Gi
nvidia.com/l20: "1"
requests:
cpu: "11"
memory: 120Gi
nvidia.com/l20: "1"
volumeMounts:
....
- args:
- |-
until curl --max-time 5 --fail http://127.0.0.1:8000 > /dev/null 2>&1; do
echo "[WAITING] model is not ready yet...";
sleep 5;
done &&
aibrix_runtime --port 8080
command:
- /bin/bash
- -lc
- --
env:
- name: INFERENCE_ENGINE
value: vllm
- name: INFERENCE_ENGINE_ENDPOINT
value: http://localhost:8000
- name: PYTORCH_CUDA_ALLOC_CONF
value: expandable_segments:True
image: aibrix-runtime:v0.2.1
name: aibrix-runtime
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
enableServiceLinks: false
imagePullSecrets:
....
schedulerName: volcano
tolerations:
...
volumes:
...
rayVersion: 2.40.0
workerGroupSpecs:
- groupName: small-group
maxReplicas: 3
minReplicas: 0
numOfHosts: 1
rayStartParams: {}
replicas: 0
scaleStrategy: {}
template:
metadata:
annotations:
prometheus.io/custom: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
ray.io/overwrite-container-cmd: "true"
labels:
app: mix-ray-cards-review-v3
global/index: 682aa920d5cdaaab68ccae82
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
namespace: prdsafe
spec:
containers:
- args:
- ulimit -n 65536; echo worker; $KUBERAY_GEN_RAY_START_CMD;
command:
- /bin/bash
- -c
- --
env:
- name: HBOX_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: RANK
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
- name: HF_ENDPOINT
value: https://hf-mirror.com
image: vllm-openai:v0.8.4-ds
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-worker
resources:
limits:
cpu: "11"
memory: 120Gi
nvidia.com/l20: "1"
requests:
cpu: "11"
memory: 120Gi
nvidia.com/l20: "1"
volumeMounts:
....
enableServiceLinks: false
imagePullSecrets:
schedulerName: volcano
tolerations:
...
volumes:
...
- groupName: small-group-2
maxReplicas: 3
minReplicas: 1
numOfHosts: 1
rayStartParams: {}
replicas: 1
scaleStrategy: {}
template:
metadata:
annotations:
...
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
labels:
app: mix-ray-cards-review-v3
global/index: 682aa920d5cdaaab68ccae82
model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
namespace: prdsafe
spec:
containers:
- args:
- ulimit -n 65536; echo worker; $KUBERAY_GEN_RAY_START_CMD;
command:
- /bin/bash
- -c
- --
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: RANK
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
- name: HF_ENDPOINT
value: https://hf-mirror.com
image: vllm-openai:v0.8.4-ds
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: ray-worker
resources:
limits:
cpu: "11"
memory: 116Gi
nvidia.com/4090: "1"
requests:
cpu: "11"
memory: 116Gi
nvidia.com/4090: "1"
volumeMounts:
...
enableServiceLinks: false
imagePullSecrets:
...
schedulerName: volcano
tolerations:
...
volumes:
...
@ying2025 I really appreciate the feedback. We will add those issues to the release story. It would be great to link the existing if you already create them. We can put everything under v0.4.0 umbrella
@ying2025 I really appreciate the feedback. We will add those issues to the release story. It would be great to link the existing if you already create them. We can put everything under v0.4.0 umbrella
Ok. next, I will create the link
should we integration the aibrix connector to the vllm repository like lmcache connector?
@sydnash Apologies for missing this comment earlier. Yes, we do plan to integrate the AIBrix connector into the vLLM repository. We're aiming to make the integration as efficient as possible. While it's currently a mid-priority task, we intend to upstream the v1 version once it's complete. In the meantime, users can still achieve strong performance using the latest AIBrix builds, which we will continue to maintain and provide.
Multi-Tenancy & Batch Inference will be postponed to v0.5.0 release. rest work have been delivered in v0.4.0, along with kv events subscription and multi engine support. We will close this roadmap issue.