Support multiple Lora adapter replicas
🚀 Feature Description and Motivation
In the initial version, to simplify the the model adapter autoscaling, we determine to support only 1 replica in the CRD. Technically, we should support multiple replicas to allow higher throughput.
Use Case
In my production deployment, it need higher throughput and I want multiple lora to be deployed in the environments.
Proposed Solution
- Enable replicas in the lora crd
- Make sure the scheduling algorithm can correctly schedule the lora. We need to handle some special cases like num of loras <= num of pods. It's meaningless to support > 1 loras on single pod.
- (Optional) support lora autoscaling
It's meaningless to support > 1 loras on single pod.
Quick q: did you mean support "< 1 loras on single pod"?
@xieus this is a constraints on the scheduling. single lora model adapter can be scheduled to the pod no more than 1 replica. 2 replicas on single pod won't be helpful from the throughput perspective
#205 becomes a large change and I notice there're some edge cases needs to cover. I will postpone this feature to rc3.
It takes some time to refactor the current code base to improve the extensibility for such changes. I already move some refactor codes changes from #205 to #260 . This would be moved to v0.2.0
move to later release due to limited times.
This has been supported in https://github.com/vllm-project/aibrix/pull/1132
I feel we need to change the design a little bit.
- Lora replicas introduce hierarchy level, it's hard to manage everything in model adapter layer. status.phase etc can only indicate single replica status but not all.
- Lora 1 or all will be much cleaner. This aligns with the #1132 original ideas to support multiple replicas
Once we have enough use cases, we can extend to the hierarchy design