aibrix
aibrix copied to clipboard
Adding scaling down to 0 case Gateway handling
🚀 Feature Description and Motivation
The autoscaler should support scaling down to 0. When a new request arrives, we should have an activator component intercepts the request and initializes a new pod. Right now, we will simply get the following error if the number of replicas is 0 for a model inference request.
error on getting pods for model llama2-7b
Use Case
No response
Proposed Solution
No response