aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[Misc] Support adapter scaling to all replicas

Open dittops opened this issue 7 months ago • 1 comments

Pull Request Description

Support adapter scaling to all replicas.

  • allow controller to sync adapter instances with all active pods
  • load adapter on each pod
  • update EndpointSlice with all pod IPs
  • adjust resources and tests for multi-pod support

Related Issues

Resolves: #1095

dittops avatar May 23 '25 15:05 dittops

@dittops Great! I will spend some time this week to review this change

Jeffwan avatar May 28 '25 22:05 Jeffwan

@dittops I think the only part I was not that sure is the scheduling part. can you give more details?

Jeffwan avatar Jun 16 '25 14:06 Jeffwan

Have used the following logic for Adapter loading/unloading

  1. The pods are selected with Label-Based Matching - adapter.model.aibrix.ai/enabled: "true"
  2. The selected pods are added to the Status.Instances list
  3. Use the reconcileLoading function to iteratively load the adapters on each of the pods from the Instances list.

dittops avatar Jun 18 '25 09:06 dittops

@dittops the workflow sounds good. from the change change, I notice the lora scheduling logic has been deleted. In this case, how to select pods? image

Jeffwan avatar Jun 19 '25 00:06 Jeffwan

scheulePod was used to pick one pod and then assigned to instance.Status.Instances. Instead of choosing one pod, the new approach uses ALL pods that match the selector and adds all matching pods to instance.Status.Instances

If we need to keep the schedulePod, we can move the getActivePodsForModelAdapter inside schedulePod to select all pods and then return a list

dittops avatar Jun 19 '25 01:06 dittops

@dittops Yeah, I think the behavior has changed a bit recently.

Option 1: Schedule the LoRA model to specific pods based on the specified replicas. Option 2: Load the LoRA into all base model replicas so that all models are identical — this is the approach you're switching to.

While Option 2 is a valid pattern that we can support, I strongly recommend sticking with Option 1 (with multi-replica support) as the primary solution for now. In our case, some high-rank LoRA models are quite large, and it's not practical to scale using Option 2. We could consider adding Option 2 as a separate feature later.

What do you think?

Jeffwan avatar Jun 19 '25 02:06 Jeffwan

@Jeffwan Are you referring to adding a replica count in the ModelAdapter and use that for scheduling? eg:

  spec:
    replicas: 3  # Only load on 3 pods
    podSelector:
      matchLabels:
        model.aibrix.ai/name: base-model

dittops avatar Jun 19 '25 04:06 dittops

@dittops exactly. https://github.com/vllm-project/aibrix/blame/main/api/model/v1alpha1/modeladapter_types.go#L53

Jeffwan avatar Jun 19 '25 07:06 Jeffwan

@dittops apologies for late response. I am recently refactoring lora work to provide better production level support. I want to merge this one first before I refactor the codes. However Seems I can not rebase main branch changes, could you help rebase the branch?

image

Jeffwan avatar Aug 17 '25 16:08 Jeffwan

@Jeffwan, I have rebased. Could you take a look?

dittops avatar Aug 18 '25 02:08 dittops

image

@dittops I think the problem is vLLM community update the CI after this PR is out, result in some failures on the CI checking. I try to close this PR and reopen it, but seems it doesn't help that much. Could you help cut a new PR, we can directly work on new one and discard this one.

Jeffwan avatar Aug 18 '25 07:08 Jeffwan

@dittops now it's working. thanks! no need to cut new PRs

Jeffwan avatar Aug 18 '25 09:08 Jeffwan

/gemini review

Jeffwan avatar Aug 18 '25 09:08 Jeffwan

the change overall looks good to me. I will address gemini's feedback in later refactor

Jeffwan avatar Aug 18 '25 09:08 Jeffwan