aibrix v0.4.0 roadmap

🚀 Feature Description and Motivation

We’re actively evolving AIBrix to support more advanced and production-ready LLM serving capabilities. For v0.4.0 and beyond, our roadmap includes:

Prefill & Decode Disaggregation: Enable architectural support for separating prefill and decode stages across devices or nodes to maximize throughput and resource utilization.
KVCache Offloading Framework Evolution: Extend support to the vLLM v1 architecture, which introduces layer-by-layer pipelined KV transmission—enabling lower latency and better parallelism compared to the v0 design.
Multi-Tenancy & Isolation: Introduce tenancy as a first-class concept in AIBrix—supporting per-tenant model isolation, request segregation, and fine-grained SLO control for production use cases.
(Optional) Batch Inference & Request Collocation: Optimize request routing across heterogeneous GPU types to improve cost-efficiency, particularly under mixed workload conditions.

Stay tuned for our upcoming v0.4.0 roadmap update! If you're interested in contributing new features or helping shape the direction of AIBrix, we’d love to hear from you.

Use Case

N/A

Proposed Solution

No response

May 18 '25 06:05 Jeffwan

Httproute should update if the deployment or rayclusterfleet label modelIdentifier change with https://github.com/vllm-project/aibrix/blob/main/pkg/controller/modelrouter/modelrouter_controller.go#L87, and if split deployment identifier from model deployment, maybe the label modelIdentifier not need to chang.
Rayclusterfleet replica scale to 0, sometime the raycluster is not reduced to 0, and it also need to repair.
For the Heterogeneous GPU Inference, we should increase the mechanism of implementing heterogeneity through ray workgroups, such as

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: RayClusterFleet
metadata:
  annotations:
    prometheus.io/custom: "true"
    prometheus.io/path: /metrics
    prometheus.io/port: "8000"
    prometheus.io/scrape: "true"
  creationTimestamp: "2025-05-19T06:04:46Z"
  generation: 1
  labels:
    app: mix-ray-cards-review-v3
    global/index: 682aa920d5cdaaab68ccae82
    k8s.io/priority: P3
    k8s.io/product.type: perception
    k8s.io/trace.env: test
    model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
  name: review-v3-hg-sc-682aa920d5cdaaab6
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/custom: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
        ray.io/overwrite-container-cmd: "true"
      labels:
        app: mix-ray-cards-review-v3
        global/index: 682aa920d5cdaaab68ccae82
        k8s.io/priority: P3
        model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
      name: review-v3-hg-sc-682aa920d5cdaaab6
    spec:
      autoscalerOptions:
        idleTimeoutSeconds: 60
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
        upscalingMode: Conservative
      enableInTreeAutoscaling: true
      headGroupSpec:
        rayStartParams:
          block: "false"
          dashboard-host: 0.0.0.0
        template:
          metadata:
            annotations:
              ....
            labels:
              ....
              model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
          spec:
            containers:
            - args:
              - ulimit -n 65536;echo head;$KUBERAY_GEN_RAY_START_CMD;python3 -m  vllm.entrypoints.openai.api_server
                --port 8000 --model /models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B  --tensor-parallel-size
                1 --pipeline-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len
                4096 --served-model-name mix-ray-cards-review-v3-uuzj6l0b --uvicorn-log-level
                warning --trust-remote-code;
              command:
              - /bin/bash
              - -c
              - --
              env:
              - name: HBOX_NODE_NAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: spec.nodeName
              - name: RANK
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
              image: vllm-openai:v0.8.4-ds
              imagePullPolicy: IfNotPresent
              name: ray-head
              ports:
              - containerPort: 6379
                name: gcs-server
                protocol: TCP
              - containerPort: 8265
                name: dashboard
                protocol: TCP
              - containerPort: 10001
                name: client
                protocol: TCP
              - containerPort: 8000
                name: service
                protocol: TCP
              resources:
                limits:
                  cpu: "11"
                  memory: 120Gi
                  nvidia.com/l20: "1"
                requests:
                  cpu: "11"
                  memory: 120Gi
                  nvidia.com/l20: "1"
              volumeMounts:
              ....
            - args:
              - |-
                until curl --max-time 5 --fail http://127.0.0.1:8000 > /dev/null 2>&1; do
                                  echo "[WAITING] model is not ready yet...";
                                  sleep 5;
                                done &&
                                aibrix_runtime --port 8080
              command:
              - /bin/bash
              - -lc
              - --
              env:
              - name: INFERENCE_ENGINE
                value: vllm
              - name: INFERENCE_ENGINE_ENDPOINT
                value: http://localhost:8000
              - name: PYTORCH_CUDA_ALLOC_CONF
                value: expandable_segments:True
              image: aibrix-runtime:v0.2.1
              name: aibrix-runtime
              ports:
              - containerPort: 8080
                protocol: TCP
              readinessProbe:
                httpGet:
                  path: /ready
                  port: 8080
                initialDelaySeconds: 5
                periodSeconds: 10
              resources:
                limits:
                  cpu: "1"
                  memory: 1Gi
                requests:
                  cpu: "1"
                  memory: 1Gi
            enableServiceLinks: false
            imagePullSecrets:
            ....
            schedulerName: volcano
            tolerations:
            ...
            volumes:
            ...
      rayVersion: 2.40.0
      workerGroupSpecs:
      - groupName: small-group
        maxReplicas: 3
        minReplicas: 0
        numOfHosts: 1
        rayStartParams: {}
        replicas: 0
        scaleStrategy: {}
        template:
          metadata:
            annotations:
              prometheus.io/custom: "true"
              prometheus.io/path: /metrics
              prometheus.io/port: "8000"
              prometheus.io/scrape: "true"
              ray.io/overwrite-container-cmd: "true"
            labels:
              app: mix-ray-cards-review-v3
              global/index: 682aa920d5cdaaab68ccae82
              model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
            namespace: prdsafe
          spec:
            containers:
            - args:
              - ulimit -n 65536; echo worker; $KUBERAY_GEN_RAY_START_CMD;
              command:
              - /bin/bash
              - -c
              - --
              env:
              - name: HBOX_NODE_NAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: spec.nodeName
              - name: RANK
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
              - name: HF_ENDPOINT
                value: https://hf-mirror.com
              image: vllm-openai:v0.8.4-ds
              imagePullPolicy: IfNotPresent
              lifecycle:
                preStop:
                  exec:
                    command:
                    - /bin/sh
                    - -c
                    - ray stop
              name: ray-worker
              resources:
                limits:
                  cpu: "11"
                  memory: 120Gi
                  nvidia.com/l20: "1"
                requests:
                  cpu: "11"
                  memory: 120Gi
                  nvidia.com/l20: "1"
              volumeMounts:
              ....
            enableServiceLinks: false
            imagePullSecrets:
            schedulerName: volcano
            tolerations:
            ...
            volumes:
            ...
      - groupName: small-group-2
        maxReplicas: 3
        minReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        replicas: 1
        scaleStrategy: {}
        template:
          metadata:
            annotations:
              ...
              prometheus.io/path: /metrics
              prometheus.io/port: "8000"
              prometheus.io/scrape: "true"
            labels:
              app: mix-ray-cards-review-v3
              global/index: 682aa920d5cdaaab68ccae82
              model.aibrix.ai/name: mix-ray-cards-review-v3-uuzj6l0b
            namespace: prdsafe
          spec:
            containers:
            - args:
              - ulimit -n 65536; echo worker; $KUBERAY_GEN_RAY_START_CMD;
              command:
              - /bin/bash
              - -c
              - --
              env:
              - name: NODE_NAME
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: spec.nodeName
              - name: RANK
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
              - name: HF_ENDPOINT
                value: https://hf-mirror.com
              image: vllm-openai:v0.8.4-ds
              imagePullPolicy: IfNotPresent
              lifecycle:
                preStop:
                  exec:
                    command:
                    - /bin/sh
                    - -c
                    - ray stop
              name: ray-worker
              resources:
                limits:
                  cpu: "11"
                  memory: 116Gi
                  nvidia.com/4090: "1"
                requests:
                  cpu: "11"
                  memory: 116Gi
                  nvidia.com/4090: "1"
              volumeMounts:
              ...
            enableServiceLinks: false
            imagePullSecrets:
            ...
            schedulerName: volcano
            tolerations:
            ...
            volumes:
            ...

May 21 '25 10:05 ying2025

@ying2025 I really appreciate the feedback. We will add those issues to the release story. It would be great to link the existing if you already create them. We can put everything under v0.4.0 umbrella

May 21 '25 17:05 Jeffwan

@ying2025 I really appreciate the feedback. We will add those issues to the release story. It would be great to link the existing if you already create them. We can put everything under v0.4.0 umbrella

Ok. next, I will create the link

May 22 '25 01:05 ying2025

should we integration the aibrix connector to the vllm repository like lmcache connector?

Jun 04 '25 08:06 sydnash

@sydnash Apologies for missing this comment earlier. Yes, we do plan to integrate the AIBrix connector into the vLLM repository. We're aiming to make the integration as efficient as possible. While it's currently a mid-priority task, we intend to upstream the v1 version once it's complete. In the meantime, users can still achieve strong performance using the latest AIBrix builds, which we will continue to maintain and provide.

Jun 18 '25 09:06 Jeffwan

Multi-Tenancy & Batch Inference will be postponed to v0.5.0 release. rest work have been delivered in v0.4.0, along with kv events subscription and multi engine support. We will close this roadmap issue.

Aug 05 '25 03:08 Jeffwan