lora的modeladapter中设置Replicas多副本不生效
我这边对一个基础模型分别尝试启动2个pod和5个pod,然以后调整modeladapter.yaml中Replicas数量。 想实现多pod挂载lora,但是每次尝试发现都只有一个pod会进行挂载。请问这是什么原因。 kubectl get pods
qwen3-8b-d445dbdb7-fqssw 2/2 Running 0 50m
qwen3-8b-d445dbdb7-mcc4p 2/2 Running 0 50m
kubectl describe modeladapter qwen3-8b-cflora
Name: qwen3-8b-cflora
Namespace: default
Labels: model.aibrix.ai/name=qwen3-8b-cflora
model.aibrix.ai/port=8000
Annotations: <none>
API Version: model.aibrix.ai/v1alpha1
Kind: ModelAdapter
Metadata:
Creation Timestamp: 2025-10-17T07:42:18Z
Finalizers:
adapter.model.aibrix.ai/finalizer
Generation: 1
Resource Version: 441059
UID: 71775340-593c-442d-83f8-f540e2dcb377
Spec:
Artifact URL: /lora_models/qwen3-8b-cflora
Base Model: qwen3-8b
Pod Selector:
Match Labels:
adapter.model.aibrix.ai/enabled: true
model.aibrix.ai/name: qwen3-8b
Replicas: 2
Scheduler Name: least-adapters
Status:
Conditions:
Last Transition Time: 2025-10-17T07:42:18Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-10-17T07:42:18Z
Message: ModelAdapter default/qwen3-8b-cflora has been allocated to pod default/qwen3-8b-d445dbdb7-mcc4p
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-10-17T07:42:18Z
Message: ModelAdapter default/qwen3-8b-cflora is ready
Reason: ModelAdapterAvailable
Status: True
Type: Ready
Instances:
qwen3-8b-d445dbdb7-mcc4p
Phase: Running
Events: <none>
查看aibrix-controller-manager 中会有如下日志,不知道是不是这个错误导致。
E1017 07:42:18.600173 1 controller.go:316] "msg"="Reconciler error" "error"="update modelAdapter status error: Operation cannot be fulfilled on modeladapters.model.aibrix.ai \"qwen3-8b-cflora\": the object has been modified; please apply your changes to the latest version and try again" "ModelAdapter"={"name":"qwen3-8b-cflora","namespace":"default"} "controller"="model-adapter-controller" "controllerGroup"="model.aibrix.ai" "controllerKind"="ModelAdapter" "name"="qwen3-8b-cflora" "namespace"="default" "reconcileID"="303a4507-3868-4d8c-b6cd-cf8c43cda29e"
看了代码是0.4.1不支持部署多个replicas,返回的是一个pod,是这个导致的吧。 下个版本会支持的吗?
@yang753 yes. v0.4.1 replicas is not supported. BTW, I happen to refactor this part. i want to double check with the behaviors
could you check this PR https://github.com/vllm-project/aibrix/pull/1670?
the new proposal will support two cases
- replicas is nil: automatically choose all available instances
- replica is 1: choose one instance.
do you think this works in your case?
If you prefer the Kubernetes approach—e.g., running 5 LoRA adapters across 10 model instances—could you share why you wouldn’t instead merge the LoRAs and run standalone model deployments?
我这边是有多个应用所需要的llm服务,都是基于一个基础模型进行Lora微调的,目前都是合并独立部署的。 例如我有a,b两个业务,每个评估都是最高峰需要5个实例来应付所有请求。 那么合并部署的话就需要10个实例。 但是他们的高峰期时间又是不同的。所以我想的是可不可以部署5个基础模型,然后上面分别挂载a,b业务的Lora。 这样就减少了gpu成本。 其实如果Lora挂载也能动态负载均衡就更好了, 比如我启动10个基模,挂载lora配置 3~8 。能够根据业务请求量自动进行挂载和卸载, 这样有多个lora就可以动态复用基础模型。而不是每个基模启动的时候就挂载了所有lora。
@yang753
- Seems you can tolerant lora overhead, you feel using unmerged way is acceptable.
- do you expect to specify the lora replica range like 2-8 or fully serverless way? let's say we exposed an option called "auotscale: enabled or disabled". If it's enabled, you do not need to many the replicas min/max. just let controller make the decision to scale to all replicas. definitely, lora instance pick up and intelligent routing will guarantee the best performance.
this is very classic multiplexing use cases. I think we are a little bit struggle on the interface now (We have different use case internally). Feel free to give more feedback.