[router] Document supported APIs
🚀 Feature Description and Motivation
We should document the supported APIs. Besides this, I wanna ask if embedding APIs are supported.
Use Case
N/A
Proposed Solution
No response
Do you mean generation/embedding/tokenization apis supported in vLLM (https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai)? The current gateway design is more like a proxy instead of an additional API layer. Technically, it supports any protocol engine supports. The gateway plugin only validate model existence based on the registration information.
Currently, gateway configuration doesn't set any restriction yet. In future, for stability consideration, this might be changed https://github.com/vllm-project/aibrix/blob/6feec99d77c84e371da9c535054c2b8aa8912704/config/gateway/gateway.yaml
I agree that embedding or other API compatibility should be documented.
Do you mean generation/embedding/tokenization apis supported in vLLM (https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/openai)?
Yes. Since vLLM doesn’t support the batching API, it makes sense that aibrix shouldn’t mark it as supported either. As a user, I’m just curious to see from the docs which APIs are actually supported—it’d be super helpful to have that clarity! 😊
Got your point, that totally makes sense. I think it should support something similar to Kubernetes Extension API services. The Batch API is a good example—currently, there doesn't seem to be a standardized engine for it. If users implement it in a third-party manner, we should aggregate it at the gateway layer while allowing different services/components to provide it.
I’m currently working on implementing the batch API with support for object storage and local files in our production stack’s router. I’m not entirely sure yet, but I’m wondering if it’s possible to integrate this production stack router as a component of a vLLM deployment. If so, the gateway could potentially aggregate and collaborate with the production stack router to make this functionality work.
Gateway -> Router deployment -> vLLM deployment
Adding the router might introduce a bit of latency—somewhere around 1 to 10 milliseconds. But honestly, I think it’s kind of unavoidable if we’re planning to implement batching outside of vLLM. It’s just one of those trade-offs we’ll have to consider.
@gaocegege I see. Technically I think it's possible. P&D case requires such router as well.
At the same time, AIBrix has a batch RFC https://github.com/vllm-project/aibrix/issues/182 as well but due to limited resources, we have not made enough progress. Comparing to implement the routing & batch api layer together in router. I am thinking in AIBrix,
- can we have an extended server just provide the batch API service and request orchestration service (congrestion control, backpressure etc) and object management, it plays as the client and send request to backend vLLM service.
- Gateway part can added necessary routing strategy support for batch requests. (It also depends on how to implement batch)
In this case, the flow would be
Gateway (Batch Async API)-> Batch API Service -> Gateway (mostly Sync API) -> vLLM deployment.
I think this is an alternative way
The Batch API needs user management to support the List Batches, which means the gateway needs to access a metadata database.
I’m a bit unsure if it’s ideal for the gateway to handle business logic, but overall, LGTM
this task should be part of https://github.com/vllm-project/aibrix/issues/846. As v0.3.0 release approaches, we should finish this task asap
@OrdinaryCrazy any updates on the api compatibility and results comparison?
@Jeffwan @gaocegege @varungup90
I tested most of the OpenAI compatible APIs and made comparison between vllm's and AIBrix's output. Summary refer to this doc:
Some yaml files used in my test:
- Base model Qwen2.5-1.5B-Instruct
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: qwen25-15b-instruct # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: qwen25-15b-instruct
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: qwen25-15b-instruct
template:
metadata:
labels:
model.aibrix.ai/name: qwen25-15b-instruct
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- Qwen/Qwen2.5-1.5B-Instruct
- --dtype
- half
- --served-model-name
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
- qwen25-15b-instruct
image: vllm/vllm-openai:v0.7.1
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: qwen25-15b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: qwen25-15b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: qwen25-15b-instruct
type: ClusterIP
- Pooling model jinaai/jina-embeddings-v3
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: jina-embeddings-v3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: jina-embeddings-v3
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: jina-embeddings-v3
template:
metadata:
labels:
model.aibrix.ai/name: jina-embeddings-v3
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- jinaai/jina-embeddings-v3
- --dtype
- half
- --task
- embed
- --trust-remote-code
- --served-model-name
- jina-embeddings-v3
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
image: vllm/vllm-openai:v0.7.1
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: jina-embeddings-v3
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: jina-embeddings-v3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: jina-embeddings-v3
type: ClusterIP
- Rerank model BAAI/bge-reranker-base
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: bge-reranker-base # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: bge-reranker-base
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: bge-reranker-base
template:
metadata:
labels:
model.aibrix.ai/name: bge-reranker-base
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- BAAI/bge-reranker-base
- --dtype
- half
- --served-model-name
- bge-reranker-base
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
image: vllm/vllm-openai:v0.7.1
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: bge-reranker-base
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: bge-reranker-base # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: bge-reranker-base
type: ClusterIP
- Score model BAAI/bge-reranker-v2-m3
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: bge-reranker-v2-m3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: bge-reranker-v2-m3
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: bge-reranker-v2-m3
template:
metadata:
labels:
model.aibrix.ai/name: bge-reranker-v2-m3
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- BAAI/bge-reranker-v2-m3
- --dtype
- half
- --task
- score
- --served-model-name
- bge-reranker-v2-m3
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
image: vllm/vllm-openai:v0.7.1
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: bge-reranker-v2-m3
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: bge-reranker-v2-m3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: bge-reranker-v2-m3
type: ClusterIP
- Transcription model openai/whisper-large-v3
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: whisper-large-v3 # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: whisper-large-v3
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: whisper-large-v3
template:
metadata:
labels:
model.aibrix.ai/name: whisper-large-v3
spec:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- openai/whisper-large-v3
- --dtype
- half
- --task
- transcription
- --served-model-name
- whisper-large-v3
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
image: vllm/vllm-openai:v0.8.5
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: whisper-large-v3
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: whisper-large-v3 # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: whisper-large-v3
type: ClusterIP
Great work! I think Varun may make up some incompatible cases later and we still need to cut a separate PR in the documentation later before v0.3.0 release. I will close this issue and let's use umbrella one https://github.com/vllm-project/aibrix/issues/846 to track the overall progress.