aibrix [router] Document supported APIs

🚀 Feature Description and Motivation

We should document the supported APIs. Besides this, I wanna ask if embedding APIs are supported.

Use Case

N/A

Proposed Solution

No response

Feb 23 '25 01:02 gaocegege

Do you mean generation/embedding/tokenization apis supported in vLLM (https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai)? The current gateway design is more like a proxy instead of an additional API layer. Technically, it supports any protocol engine supports. The gateway plugin only validate model existence based on the registration information.

Currently, gateway configuration doesn't set any restriction yet. In future, for stability consideration, this might be changed https://github.com/vllm-project/aibrix/blob/6feec99d77c84e371da9c535054c2b8aa8912704/config/gateway/gateway.yaml

I agree that embedding or other API compatibility should be documented.

Feb 24 '25 05:02 Jeffwan

Do you mean generation/embedding/tokenization apis supported in vLLM (https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai)?

Yes. Since vLLM doesn’t support the batching API, it makes sense that aibrix shouldn’t mark it as supported either. As a user, I’m just curious to see from the docs which APIs are actually supported—it’d be super helpful to have that clarity! 😊

Feb 24 '25 05:02 gaocegege

Got your point, that totally makes sense. I think it should support something similar to Kubernetes Extension API services. The Batch API is a good example—currently, there doesn't seem to be a standardized engine for it. If users implement it in a third-party manner, we should aggregate it at the gateway layer while allowing different services/components to provide it.

Feb 24 '25 05:02 Jeffwan

I’m currently working on implementing the batch API with support for object storage and local files in our production stack’s router. I’m not entirely sure yet, but I’m wondering if it’s possible to integrate this production stack router as a component of a vLLM deployment. If so, the gateway could potentially aggregate and collaborate with the production stack router to make this functionality work.

Gateway -> Router deployment -> vLLM deployment

Adding the router might introduce a bit of latency—somewhere around 1 to 10 milliseconds. But honestly, I think it’s kind of unavoidable if we’re planning to implement batching outside of vLLM. It’s just one of those trade-offs we’ll have to consider.

Feb 24 '25 06:02 gaocegege

@gaocegege I see. Technically I think it's possible. P&D case requires such router as well.

At the same time, AIBrix has a batch RFC https://github.com/vllm-project/aibrix/issues/182 as well but due to limited resources, we have not made enough progress. Comparing to implement the routing & batch api layer together in router. I am thinking in AIBrix,

can we have an extended server just provide the batch API service and request orchestration service (congrestion control, backpressure etc) and object management, it plays as the client and send request to backend vLLM service.
Gateway part can added necessary routing strategy support for batch requests. (It also depends on how to implement batch)

In this case, the flow would be

Gateway (Batch Async API)-> Batch API Service -> Gateway (mostly Sync API) -> vLLM deployment.

I think this is an alternative way

Feb 24 '25 06:02 Jeffwan

The Batch API needs user management to support the List Batches, which means the gateway needs to access a metadata database.

I’m a bit unsure if it’s ideal for the gateway to handle business logic, but overall, LGTM

Feb 24 '25 06:02 gaocegege

this task should be part of https://github.com/vllm-project/aibrix/issues/846. As v0.3.0 release approaches, we should finish this task asap

Apr 28 '25 22:04 Jeffwan

@OrdinaryCrazy any updates on the api compatibility and results comparison?

May 06 '25 21:05 Jeffwan

@Jeffwan @gaocegege @varungup90

I tested most of the OpenAI compatible APIs and made comparison between vllm's and AIBrix's output. Summary refer to this doc:

OpenAI-Compatible API Comparison - vllm and aibrix - Feishu Docs.pdf

Some yaml files used in my test:

Base model Qwen2.5-1.5B-Instruct

      
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: qwen25-15b-instruct # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: qwen25-15b-instruct
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: qwen25-15b-instruct
  template:
    metadata:
      labels:
        model.aibrix.ai/name: qwen25-15b-instruct
    spec:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - Qwen/Qwen2.5-1.5B-Instruct
            - --dtype
            - half
            - --served-model-name
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
            - qwen25-15b-instruct
          image: vllm/vllm-openai:v0.7.1
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
              
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: qwen25-15b-instruct
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: qwen25-15b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: qwen25-15b-instruct
  type: ClusterIP

Pooling model jinaai/jina-embeddings-v3

      
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: jina-embeddings-v3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: jina-embeddings-v3
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: jina-embeddings-v3
  template:
    metadata:
      labels:
        model.aibrix.ai/name: jina-embeddings-v3
    spec:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - jinaai/jina-embeddings-v3
            - --dtype
            - half
            - --task
            - embed
            - --trust-remote-code
            - --served-model-name 
            - jina-embeddings-v3
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
          image: vllm/vllm-openai:v0.7.1
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
              
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: jina-embeddings-v3
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: jina-embeddings-v3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: jina-embeddings-v3
  type: ClusterIP

Rerank model BAAI/bge-reranker-base

      
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: bge-reranker-base # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: bge-reranker-base
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: bge-reranker-base
  template:
    metadata:
      labels:
        model.aibrix.ai/name: bge-reranker-base
    spec:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - BAAI/bge-reranker-base
            - --dtype
            - half
            - --served-model-name 
            - bge-reranker-base
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
          image: vllm/vllm-openai:v0.7.1
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
              
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: bge-reranker-base
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: bge-reranker-base # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: bge-reranker-base
  type: ClusterIP

Score model BAAI/bge-reranker-v2-m3

      
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: bge-reranker-v2-m3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: bge-reranker-v2-m3
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: bge-reranker-v2-m3
  template:
    metadata:
      labels:
        model.aibrix.ai/name: bge-reranker-v2-m3
    spec:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - BAAI/bge-reranker-v2-m3
            - --dtype
            - half
            - --task
            - score
            - --served-model-name 
            - bge-reranker-v2-m3
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
          image: vllm/vllm-openai:v0.7.1
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
              
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: bge-reranker-v2-m3
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: bge-reranker-v2-m3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: bge-reranker-v2-m3
  type: ClusterIP

Transcription model openai/whisper-large-v3

      
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: whisper-large-v3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: whisper-large-v3
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: whisper-large-v3
  template:
    metadata:
      labels:
        model.aibrix.ai/name: whisper-large-v3
    spec:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - openai/whisper-large-v3
            - --dtype
            - half
            - --task
            - transcription
            - --served-model-name 
            - whisper-large-v3
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
          image: vllm/vllm-openai:v0.8.5
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
              
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: whisper-large-v3
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: whisper-large-v3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: whisper-large-v3
  type: ClusterIP

May 09 '25 00:05 OrdinaryCrazy

Great work! I think Varun may make up some incompatible cases later and we still need to cut a separate PR in the documentation later before v0.3.0 release. I will close this issue and let's use umbrella one https://github.com/vllm-project/aibrix/issues/846 to track the overall progress.

May 09 '25 17:05 Jeffwan