aibrix lora的modeladapter中设置Replicas多副本不生效

我这边对一个基础模型分别尝试启动2个pod和5个pod,然以后调整modeladapter.yaml中Replicas数量。想实现多pod挂载lora，但是每次尝试发现都只有一个pod会进行挂载。请问这是什么原因。 kubectl get pods

qwen3-8b-d445dbdb7-fqssw                                      2/2     Running            0          50m
qwen3-8b-d445dbdb7-mcc4p                                      2/2     Running            0          50m

kubectl describe modeladapter qwen3-8b-cflora

Name:         qwen3-8b-cflora
Namespace:    default
Labels:       model.aibrix.ai/name=qwen3-8b-cflora
              model.aibrix.ai/port=8000
Annotations:  <none>
API Version:  model.aibrix.ai/v1alpha1
Kind:         ModelAdapter
Metadata:
  Creation Timestamp:  2025-10-17T07:42:18Z
  Finalizers:
    adapter.model.aibrix.ai/finalizer
  Generation:        1
  Resource Version:  441059
  UID:               71775340-593c-442d-83f8-f540e2dcb377
Spec:
  Artifact URL:  /lora_models/qwen3-8b-cflora
  Base Model:    qwen3-8b
  Pod Selector:
    Match Labels:
      adapter.model.aibrix.ai/enabled:  true
      model.aibrix.ai/name:             qwen3-8b
  Replicas:                             2
  Scheduler Name:                       least-adapters
Status:
  Conditions:
    Last Transition Time:  2025-10-17T07:42:18Z
    Message:               Starting reconciliation
    Reason:                ModelAdapterPending
    Status:                Unknown
    Type:                  Initialized
    Last Transition Time:  2025-10-17T07:42:18Z
    Message:               ModelAdapter default/qwen3-8b-cflora has been allocated to pod default/qwen3-8b-d445dbdb7-mcc4p
    Reason:                Scheduled
    Status:                True
    Type:                  Scheduled
    Last Transition Time:  2025-10-17T07:42:18Z
    Message:               ModelAdapter default/qwen3-8b-cflora is ready
    Reason:                ModelAdapterAvailable
    Status:                True
    Type:                  Ready
  Instances:
    qwen3-8b-d445dbdb7-mcc4p
  Phase:  Running
Events:   <none>

查看aibrix-controller-manager 中会有如下日志，不知道是不是这个错误导致。 E1017 07:42:18.600173 1 controller.go:316] "msg"="Reconciler error" "error"="update modelAdapter status error: Operation cannot be fulfilled on modeladapters.model.aibrix.ai \"qwen3-8b-cflora\": the object has been modified; please apply your changes to the latest version and try again" "ModelAdapter"={"name":"qwen3-8b-cflora","namespace":"default"} "controller"="model-adapter-controller" "controllerGroup"="model.aibrix.ai" "controllerKind"="ModelAdapter" "name"="qwen3-8b-cflora" "namespace"="default" "reconcileID"="303a4507-3868-4d8c-b6cd-cf8c43cda29e"

Oct 17 '25 08:10 yang753

看了代码是0.4.1不支持部署多个replicas，返回的是一个pod，是这个导致的吧。下个版本会支持的吗？

Oct 17 '25 11:10 yang753

@yang753 yes. v0.4.1 replicas is not supported. BTW, I happen to refactor this part. i want to double check with the behaviors

could you check this PR https://github.com/vllm-project/aibrix/pull/1670?

the new proposal will support two cases

replicas is nil: automatically choose all available instances
replica is 1: choose one instance.

do you think this works in your case?

If you prefer the Kubernetes approach—e.g., running 5 LoRA adapters across 10 model instances—could you share why you wouldn’t instead merge the LoRAs and run standalone model deployments?

Oct 17 '25 17:10 Jeffwan

我这边是有多个应用所需要的llm服务，都是基于一个基础模型进行Lora微调的，目前都是合并独立部署的。例如我有a,b两个业务，每个评估都是最高峰需要5个实例来应付所有请求。那么合并部署的话就需要10个实例。但是他们的高峰期时间又是不同的。所以我想的是可不可以部署5个基础模型，然后上面分别挂载a，b业务的Lora。这样就减少了gpu成本。其实如果Lora挂载也能动态负载均衡就更好了，比如我启动10个基模，挂载lora配置 3~8 。能够根据业务请求量自动进行挂载和卸载，这样有多个lora就可以动态复用基础模型。而不是每个基模启动的时候就挂载了所有lora。

Oct 19 '25 08:10 yang753

@yang753

Seems you can tolerant lora overhead, you feel using unmerged way is acceptable.
do you expect to specify the lora replica range like 2-8 or fully serverless way? let's say we exposed an option called "auotscale: enabled or disabled". If it's enabled, you do not need to many the replicas min/max. just let controller make the decision to scale to all replicas. definitely, lora instance pick up and intelligent routing will guarantee the best performance.

this is very classic multiplexing use cases. I think we are a little bit struggle on the interface now (We have different use case internally). Feel free to give more feedback.

Oct 22 '25 05:10 Jeffwan