How to install and deploy AIBrix on a single server?
How to install and deploy AIBrix on a single server?
This is the original question. https://github.com/vllm-project/aibrix/issues/1396#issuecomment-3425452314 I kindly request your help. Thank you. @Jeffwan @googs1025
Thank you very much for your reply. However, for beginners who are new to LLM AIBrix, the first installation and use can be difficult. Is there a more clear and straightforward installation and usage tutorial? I have a server with 8 L40S GPUs (I have the Qwen3-32B model files here, and of course other models as well, like Llama3.2 3B).
step 1 https://aibrix.readthedocs.io/latest/getting_started/installation/lambda.html
I encountered some problems during the installation process. First, do I need to create a Docker container on the server and then install AIBrix in this container? According to the tutorial you provided, I tried to install Lambda Cloud. But on the Lambda docs official website (https://docs.lambda.ai/), I don't know how to select and install it.
step 2 I get into https://docs.lambda.ai/ go to the PUBLIC CLOUD
I get these: https://docs.lambda.ai/education/ ,
Large language models (LLMs)# 大型语言模型(LLMs)# Deploying a Llama 3 inference endpoint 部署Llama 3推理端点 Deploying Llama 3.2 3B in a Kubernetes (K8s) cluster 在Kubernetes(K8s)集群中部署Llama 3.2 3B Using KubeAI to deploy Nous Research's Hermes 3 and other LLMs 使用KubeAI部署Nous Research的Hermes 3和其他大语言模型 Serving Llama 3.1 405B on a Lambda 1-Click Cluster 在Lambda一键式集群上运行Llama 3.1 405B Serving the Llama 3.1 8B and 70B models using Lambda Cloud on-demand instances 使用Lambda Cloud按需实例部署Llama 3.1 8B和70B模型 Running DeepSeek-R1 70B using Ollama 使用Ollama运行DeepSeek-R1 70B
Which one should I choose? Whatever,I get into https://docs.lambda.ai/education/large-language-models/k8s-ollama-llama-3-2/.
IS all right ?
Is there a contradiction between creating a Docker container on the server and using "Deploying Llama 3.2 3B in a Kubernetes (K8s) cluster"? If there is no contradiction, should we first create a Docker container and then deploy Llama 3.2 3B in the Kubernetes (K8s) cluster, or first deploy Llama 3.2 3B in the Kubernetes (K8s) cluster and then create a Docker container? I am very confused.
@Jeffwan Please forgive my forwardness. I am eager to deploy AIBrix on a single server, but I have indeed encountered many problems. Thank you very much for your patience and attention.
@Jeffwan At the very beginning, I followed this tutorial to perform the operations. https://aibrix.readthedocs.io/latest/getting_started/installation/installation.html#install-aibrix-in-testing-environments
First, I used the Dockerfile to create an image, then used this image to create a Docker container,
# 基于现有NVIDIA PyTorch镜像(自带GPU/CUDA支持)
FROM nvcr.io/nvidia/pytorch:24.09-py3
# 关闭交互模式,加速系统包安装
ENV DEBIAN_FRONTEND=noninteractive
# 安装系统依赖工具并清理缓存
RUN apt-get update && apt-get install -y \
sudo \
apt-transport-https \
ca-certificates \
software-properties-common \
systemd \
curl \
&& rm -rf /var/lib/apt/lists/*
# 安装kubectl(通过官方GitHub release,使用国内加速代理)
RUN set -eux; \
# 使用GitHub镜像加速下载v1.28.0版本kubectl(避免404)
curl -fL --retry 3 "https://ghproxy.com/https://github.com/kubernetes/kubernetes/releases/download/v1.28.0/kubectl-linux-amd64" -o kubectl \
&& chmod +x kubectl \
&& mv kubectl /usr/local/bin/ \
&& kubectl version --client # 验证安装
# 安装K3s单节点集群(国内源加速,添加启动验证)
RUN set -eux; \
curl -sfL --http1.1 https://rancher-mirror.oss-cn-beijing.aliyuncs.com/k3s/k3s-install.sh | sh -s - \
--disable traefik \
--write-kubeconfig-mode 644 \
--kubeconfig /etc/rancher/k3s/k3s.yaml \
--data-dir /var/lib/k3s \
--system-default-registry "registry.cn-hangzhou.aliyuncs.com"; \
# 等待K3s服务启动(最多等待60秒)
count=0; \
until systemctl is-active --quiet k3s; do \
echo "等待K3s服务启动..."; \
sleep 5; \
if [ $((++count)) -ge 12 ]; then \
echo "K3s启动超时!"; \
exit 1; \
fi; \
done
# 配置kubectl默认连接本地K3s集群
RUN mkdir -p /root/.kube && ln -s /etc/rancher/k3s/k3s.yaml /root/.kube/config
# 复制AIBrix安装脚本并赋予执行权限
COPY install-aibrix.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/install-aibrix.sh
# 容器启动后自动执行AIBrix安装脚本
CMD ["install-aibrix.sh"]
and after that, installed the AIBrix environment according to this file -- install-aibrix.sh .
docker run -itd --name aibrix-alanchen --privileged --gpus all --network host --restart always aibrix-ready:v0.4.1
#!/bin/bash
set -e
# 等待K3s集群就绪(最多30秒)
echo "=== 等待K3s集群启动 ==="
for i in {1..30}; do
if kubectl get nodes &> /dev/null; then
echo "K3s集群就绪!"
break
fi
sleep 1
done
# 安装AIBrix依赖组件(Envoy Gateway + KubeRay)
echo -e "\n=== 安装AIBrix依赖 ==="
kubectl apply -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-dependency-v0.4.1.yaml --server-side
# 等待依赖组件启动(超时5分钟)
kubectl wait --for=condition=available deployment --all -n aibrix-system --timeout=300s || true
kubectl wait --for=condition=available deployment --all -n envoy-gateway-system --timeout=300s || true
# 安装AIBrix核心组件
echo -e "\n=== 安装AIBrix核心 ==="
kubectl apply -f https://github.com/vllm-project/aibrix/releases/download/v0.4.1/aibrix-core-v0.4.1.yaml
# 验证安装结果
echo -e "\n=== 安装完成!当前AIBrix组件状态 ==="
kubectl get pods -n aibrix-system
kubectl get pods -n envoy-gateway-system
# 保持容器运行(监控K3s日志)
tail -f /var/log/k3s/k3s.log
But it ultimately failed. Could you please tell me if there is a quick and effective installation tutorial? If possible, I am willing to write a single-server version of the AIBrix installation tutorial based on my actual experience, hoping to make my contribution to the AIBrix community. @Jeffwan Thanks again.
@googs1025 🧑🏻💻Thank you very much for your reminder. It is dedicated and professional contributors like you who have been continuously promoting the progress and development of the community. I will follow in your footsteps. Could you please help me solve this problem? https://github.com/vllm-project/aibrix/issues/1690 Thank you so much.
@Alan-D-Chen I think you just need to follow this guidance. this page gives you everything you need. what's you followed like lambda cloud llama installation is not helpful. Seems you tried some unrelated guidances.
https://aibrix.readthedocs.io/latest/getting_started/installation/lambda.html
Let me know if you encounter other issues.
hi,goodmoring @Jeffwan . Hello, I'm sorry, but I've encountered a tricky problem here. When proceeding to this step (https://aibrix.readthedocs.io/latest/getting_started/installation/lambda.html),
I ran into an issue:
kubectl get pods -n aibrix-system
these six Pods cannot be obtained.
I got five of them (or similar versions) from other sources.
I need to modify the tag of kuberay/operator:nightly to meet the requirements of kubectl get pods -n aibrix-system.
However, aibrix/gpu-optimizer:v0.4.1 cannot be found at all with the server and the local PC.
And I change the docker daemon.json:
final, these do not work at all. Can you do me a favor? thank you !
AND, I can not get good help from issues: https://github.com/vllm-project/aibrix/pull/1682 https://github.com/vllm-project/aibrix/pull/1539
However, aibrix/gpu-optimizer:v0.4.1 cannot be found at all with the server and the local PC.
where did you find this image? did you follow the guidance exactly? or you fetch and retag your self?
we use runtime image for gpu-optimizer in both helm and kustomize way
https://github.com/vllm-project/aibrix/blob/dfb5b35c97c236d2ee9322df08d6d747f6aff3ad/dist/chart/stable.yaml#L90-L92
@Jeffwan My bad. Here's the situation: I tried to download aibrix/gpu-optimizer by running docker pull aibrix/gpu-optimizer:v0.4.1 both on my local PC (which I will upload to the server later) and directly on the server, but failed to find it in both places. I also searched for aibrix/gpu-optimizer:v0.4.1 on the Docker official website and Docker Desktop, but it doesn't exist at all. Additionally, when I looked for kuberay/operator:v1.1.0 on the Docker official website and Docker Desktop, it wasn't available either. The only similar version I found is kuberay/operator:nightly. How can I obtain the correct versions of kuberay/operator:v1.1.0 and aibrix/gpu-optimizer:v0.4.1?? The command kubectl get pods -n aibrix-system cannot be executed.
My bad。是这样的,我分别在 我本地的PC 上(我使用 了VPN docker pull,稍后再上传到 服务器上) 和服务器上(无法使用 VPN) 使用 docker pull aibrix/gpu-optimizer:v0.4.1 下载 aibrix/gpu-optimizer 但是 在docker 官方网站 都找不到。我在 docker 官网 和 docker desktop 查找 是否有 aibrix/gpu-optimizer:v0.4.1 ,发现没有 aibrix/gpu-optimizer:v0.4.1 。而且在 在 docker 官网 和 docker desktop 查找 kuberay/operator:v1.1.0,也没有,kuberay/operator:v1.1.0。 只有一个相似的版本 kuberay/operator:nightly. 我该如何获取正确的kuberay/operator:v1.1.0 和 aibrix/gpu-optimizer:v0.4.1 呢?这条命令 kubectl get pods -n aibrix-system 无法执行。
root@a7:/home/chendong/aibrix# kubectl get pods -n aibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-57bd8857f4-v2l65 0/1 ImagePullBackOff 0 15h
aibrix-gateway-plugins-7dfc7569b-sdtlp 0/1 Init:ImagePullBackOff 0 15h
aibrix-gpu-optimizer-7d9bbf9c7c-c69zc 0/1 ImagePullBackOff 0 15h
aibrix-kuberay-operator-9b8548c98-ghtgq 0/1 ImagePullBackOff 0 15h
aibrix-metadata-service-7668b6f95d-dvgzh 0/1 Init:ImagePullBackOff 0 15h
aibrix-redis-master-56cbb99b6b-qkzkq 0/1 ImagePullBackOff 0 15h
root@a7:/home/chendong/aibrix# ps aux | grep minikube | grep tunnel
root 957570 0.1 0.0 2657904 92060 pts/18 Sl+ 01:52 0:00 minikube tunnel
我知道你说的那个地方:
@Jeffwan 请问,这句命令kubectl get pods -n aibrix-system 是只能这样使用?还是说,可以添加一个国内源头,可以一次性方便的下载?我现在就差最后一步了。我在 AIBrix 的 issue 里面,也没有找到合理的解决办法。如果可以的话,您可以帮助我一下吗?
这是在本地 PC上操作的结果:
这是在 服务器上 操作的结果:
@Jeffwan 一句话说明白~~ 我在 执行 kubectl get pods -n aibrix-system 这句命令的时候(这是安装 AIBrix 的最后一步),服务器找不到 这六个东西。
我只能通过 不同的途径 docker pull 一个一个 下载,最终 aibrix/gpu-optimizer:v0.4.1 和 kuberay/operator:v1.1.0 都找不到 或者说找不到合适的版本。 如何解决这个问题呢?我真的已经使用 中文和英文 尽可能的说明白这个问题了。
我在 docker hub 这个官网上 进行了查找,根本就没有 aibrix/gpu-optimizer 这个 images :
maybe, I get it . wait a sec~~
@Jeffwan 您好,很抱歉再次打搅您。这是来自 来自一个心力交瘁的 coder 的问题。对于 kubectl get pods -n aibrix-system,我反复的得到一系列问题:(这是我在短短几秒钟中得到结果, 在极短的时间内,一些 pods 就不能使用了)
root@a7:/home/chendong/aibrix# kubectl get pods -n aibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-57bd8857f4-wxgt4 0/1 Running 0 11s
aibrix-gateway-plugins-7dfc7569b-g4q25 0/1 Init:0/1 0 11s
aibrix-gpu-optimizer-7d9bbf9c7c-rghlg 1/1 Running 0 11s
aibrix-kuberay-operator-9b8548c98-r4lsf 0/1 ContainerCreating 0 11s
aibrix-metadata-service-7668b6f95d-fjt79 0/1 Init:0/1 0 11s
aibrix-redis-master-56cbb99b6b-fbzl9 0/1 ContainerCreating 0 11s
root@a7:/home/chendong/aibrix# kubectl get pods -n aibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-57bd8857f4-wxgt4 1/1 Running 0 16s
aibrix-gateway-plugins-7dfc7569b-g4q25 0/1 Init:0/1 0 16s
aibrix-gpu-optimizer-7d9bbf9c7c-rghlg 1/1 Running 0 16s
aibrix-kuberay-operator-9b8548c98-r4lsf 0/1 ContainerCreating 0 16s
aibrix-metadata-service-7668b6f95d-fjt79 0/1 Init:0/1 0 16s
aibrix-redis-master-56cbb99b6b-fbzl9 0/1 ContainerCreating 0 16s
root@a7:/home/chendong/aibrix# kubectl get pods -n aibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-57bd8857f4-wxgt4 1/1 Running 0 19s
aibrix-gateway-plugins-7dfc7569b-g4q25 0/1 Init:0/1 0 19s
aibrix-gpu-optimizer-7d9bbf9c7c-rghlg 1/1 Running 0 19s
aibrix-kuberay-operator-9b8548c98-r4lsf 0/1 ContainerCreating 0 19s
aibrix-metadata-service-7668b6f95d-fjt79 0/1 Init:0/1 0 19s
aibrix-redis-master-56cbb99b6b-fbzl9 0/1 ErrImagePull 0 19s
我想我已经解决了 images 的问题:
root@a7:/home/chendong/aibrix# minikube cache list
root@a7:/home/chendong/aibrix# minikube image list
registry.k8s.io/pause:3.10.1
registry.k8s.io/kube-scheduler:v1.34.0
registry.k8s.io/kube-proxy:v1.34.0
registry.k8s.io/kube-controller-manager:v1.34.0
registry.k8s.io/kube-apiserver:v1.34.0
registry.k8s.io/etcd:3.6.4-0
registry.k8s.io/coredns/coredns:v1.12.1
nvcr.io/nvidia/k8s-device-plugin:<none>
gcr.io/k8s-minikube/storage-provisioner:v5
docker.io/library/redis:7.4
docker.io/library/busybox:stable
docker.io/kuberay/operator:v1.1.0
docker.io/kuberay/operator:nightly
docker.io/envoyproxy/gateway:v1.2.8
docker.io/envoyproxy/envoy:v1.33.2
docker.io/aibrix/runtime:v0.4.1
docker.io/aibrix/metadata-service:v0.4.1
docker.io/aibrix/gateway-plugins:v0.4.1
docker.io/aibrix/controller-manager:v0.4.1
@Jeffwan 我一时情急,只能使用汉语提问。要是有什么不妥,我可以使用英文再问一遍。请问 是需要 所有的 pods 都是 ready 的吗?才可以进入到 AIbrix 的使用吗?
@Jeffwan 请问该如何解决这问题呢? 我参考了多个 方法,仍然无法解决这问题。如果可以的话,请您帮我一下,谢谢。
我是严格按照 https://aibrix.readthedocs.io/latest/getting_started/installation/lambda.html# 这个教程进行操作的。
镜像问题
镜像问题
@googs1025 ,如果可以的话,可以详细说说吗?谢谢~~初学者真的很迷茫。我已经搞了差不多 5天时间了~
部署方式有很多:
- 可以使用 Quickstart: https://aibrix.readthedocs.io/latest/getting_started/quickstart.html 直接部署
- 也可以使用社区提供的 helm 部署:https://github.com/vllm-project/aibrix/tree/main/dist/chart
请自行解决网络无法拉取镜像的问题,另外如果是手动拉取镜像请使用 minikube load 把镜像 load 到 minikube 中
Please accept my heartfelt thanks for your strong support. I utilized the AIBrix framework to perform both aggregated/centralized and Prefill/Decode (PD) disaggregated inference tests within a single server environment. It is my hope that this contribution will aid developers and the broader coding community in their use of AIBrix. @Jeffwan @googs1025
And results:
这是 我对 AIBrix 集中式、PD分离式推理的一点理解,不知道对不对。希望大家指正。还有,之前是不是说 要推出不使用 K8s 或者 minikube 的版本吗?现在好了吗?正常来说 AIbrix 是需要部署在 数十台服务器上 管理 成百上千个GPU的,对吗? @Jeffwan
@Alan-D-Chen awesome work!
之前是不是说 要推出不使用 K8s 或者 minikube 的版本吗? 现在好了吗?正常来说 AIbrix 是需要部署在 数十台服务器上 管理 成百上千个GPU的,对吗?
it's not fully finished yet. I will keep you posted once it's done. the process orchestration takes some time, especially for P/D
And results:
@Alan-D-Chen this is awesome! but from the results perspective, I didn't see big difference between P/D and non P/D. Technically, the decoding latencieis for non P/D is much higher.
Could I know some of your your setup details?
- what chips?
- tp2 (4 replicas?) vs 2P2D (total 4* TP2 = 8 gpus)
- what's the workload input and output size?
Seems your decoding latency starts with ~20ms and gradually drops to 10ms from the table, however, in the stylesheet, the TOPO or ITL numbers are not matched. I am not sure if there's typo or I understand incorrectly.
Answer:
- Wed Nov 26 09:35:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08 Driver Version: 570.148.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:01:00.0 Off | Off |
| N/A 39C P0 111W / 350W | 845MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
-
aggregation:Tensor Parallelism = 2 , 2 GPUs for inference PD disaggregation 2 GPUs for prefilling , 2 GPUs for decoding
-
the core content in the script
CONCURRENCIES=(10 20 30 40 50 60 70 80 90 100 110 120)
BASE_CMD="python3 /vllm-workspace/benchmarks/benchmark_serving.py \
--model Qwen3-32B \
--dataset-name sonnet \
--port 8000 \
--sonnet-input-len 512 \
--sonnet-output-len 256 \
--endpoint /v1/chat/completions \
--tokenizer /mnt/LLM_models/Qwen3-32B/ \
--dataset-path /vllm-workspace/benchmarks/sonnet.txt \
--backend openai-chat \
--trust-remote-code"