Support predictive autoscaling for LLM inference
π Feature Description and Motivation
We used to do a lot of work around reactive autoscaling, however, the model bootstrap still take looks long time. Instead of traditional autoscaling, we want to provide another option - the time series prediction method to solve the latency issue.
Use Case
LLM autoscaling
Proposed Solution
No response
Hi @Jeffwan! I'm very interested in this predictive autoscaling feature.
I have experience with time series forecasting (using Prophet/LSTM) and Kubernetes autoscaling. I understand the pain point of model bootstrap latency in LLM inference.
I'd like to work on this issue. My approach would be:
- Design a time series prediction module for traffic forecasting
- Integrate with existing autoscaling mechanisms
- Add configuration options for prediction parameters
Could I be assigned to this issue? I estimate it would take 3-4 weeks to deliver a solution.
thanks for driving this effort. @Belyenochi I just assign this issue to you
Hi @Jeffwan, Thanks for assigning this to me! I'm excited to work on #1418.
I've been researching how Netflix Scryer and Google Cloud handle predictive autoscaling, and I'm curious about your thoughts on applying these approaches to LLM workloads.
From what I found, there seem to be two main philosophies:
Netflix Scryer:
- Scale-up: Aggressive predictive expansion, reactive can supplement
- Scale-down: Predictive sets safety floor, prevents over-contraction
- Philosophy: "Better to over-provision than under-serve"
Google Cloud:
- Scale-up: Confidence-driven blending of predictive + reactive signals
- Scale-down: Predictive-driven with reactive override for emergencies
- Philosophy: "Optimize for cost efficiency while maintaining SLO"
Initial thoughts for LLM scenarios:
Given the LLM cold start challenges (5-8min model loading), I'm thinking a Netflix-style asymmetric approach might work well:
Scale-Up Strategy:
- What do you think about proactive expansion based on predicted traffic patterns?
- Decision logic could be:
max(predicted_need, reactive_suggestion) - The rationale being that cold starts are so expensive for LLMs that reactive scaling might be too late
Scale-Down Strategy:
- I'm wondering if we should use predictive floor protection to prevent over-aggressive scale-down
- Maybe something like:
max(predicted_minimum, reactive_suggestion) - This could help avoid those painful cold start delays when traffic picks back up
Technical Implementation Questions:
Currently AIBRIX pulls metrics directly from vLLM pods. For predictive capabilities, I'm thinking we could:
- Add Prometheus integration: Query historical metrics for pattern analysis - does this align with your API refactor plans?
- Implement Netflix Scryer-style prediction:
- Weekly periodicity analysis (they use same-weekday patterns since Tuesday vs Tuesday is more similar than Tuesday vs Wednesday)
- 8-12 weeks lookback window with time-decay weighting
- 10-minute aggregation windows for pattern detection
- Extend PodAutoscaler: Add predictive metricSource alongside existing reactive ones
Do you think this approach makes sense for our use cases? I'm curious about:
- Whether our production workloads show clear weekly/daily patterns that would benefit from this kind of prediction
- Any constraints from the API refactor that might influence the design
- Your thoughts on the Netflix vs Google philosophy for our scenarios
- Whether Prometheus as the single data source would be sufficient for this
References:
- Netflix Scryer: http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
- Netflix Scryer Part 2: https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-part-2-bb9c4f9b9385
- Google Cloud predictive autoscaling: https://cloud.google.com/compute/docs/autoscaler/predictive-autoscaling
Perfect timing on the API refactor - having a clean, well-designed autoscaling API will make integrating the predictive component much smoother! While we wait for the new baseline, I can start working on the predictive algorithm design and Prometheus integration.
Looking forward to discussing this further!
Thanks!
@Belyenochi Great start! I love this rough design. Hereβs how Iβd evaluate and shape predictive autoscaling for LLM workloads in AIBrix.
- I think both the Netflix and Google philosophies are valuable, and the choice between them is mostly a strategy question If I understand correctly. I do not have strong preference at this moment. I will learn and give some comments later
- Unified entry point: expose predictive and reactive as sources within the same API, so users just declare whether they want prediction enabled, not learn a different CRD or controller. Let's try to what changes need to be done in https://github.com/vllm-project/aibrix/blob/main/api/autoscaling/v1alpha1/podautoscaler_types.go. If the API is totally different, we could consider to use a new one.
- Bytedance used to work on intelligent HPA in project katalyst. I linked the reference there. My original ideas is to bring the code back to AIBrix to save some efforts since we are from same team. :D Please help evaluate the ihpa and see whether we can save some efforts if we build a new solution on top of it.
- Model specific metrics: Most time-series prediction solution requires x days traffic to be able to provide prediction, let me know anything I can support your development work
references:
- https://dl.acm.org/doi/pdf/10.1145/3342195.3387524 Google's autopilot paper
- https://github.com/kubewharf/katalyst-api/blob/main/pkg/apis/autoscaling/v1alpha2/ihpa.go
- https://github.com/kubewharf/katalyst-core/tree/main/pkg/controller/ihpa
Hi @Jeffwan,
Thanks for the detailed feedback! After analyzing the katalyst iHPA implementation, I have some findings and would like to discuss the best path forward:
Current State Analysis
I examined the katalyst iHPA code (ihpa.go and controller and found that the current iHPA doesn't have time-series prediction capabilities yet. It's essentially a wrapper around standard Kubernetes HPA with enhanced integration for SPD (Service Profile Descriptor) and VirtualWorkload management.
Proposed Approach
Given your suggestion to leverage existing katalyst work, I'm thinking the most beneficial approach would be:
Phase 1: Enhance katalyst iHPA with Predictive Capabilities
Since katalyst is an open source project, I'd like to contribute the predictive autoscaling implementation directly to iHPA. This would benefit the entire cloud-native community and align perfectly with the goals of Issue #1418.
Proposed API Extension for katalyst iHPA:
// In katalyst-api/pkg/apis/autoscaling/v1alpha2/ihpa.go
type IntelligentHorizontalPodAutoscalerSpec struct {
// Existing fields...
Autoscaler AutoscalerSpec `json:"autoscaler"`
// New: Predictive scaling configuration
// +optional
PredictiveConfig *PredictiveConfig `json:"predictiveConfig,omitempty"`
}
type PredictiveConfig struct {
Enabled bool `json:"enabled"`
LookAheadMinutes *int32 `json:"lookAheadMinutes,omitempty"`
HistoryDays *int32 `json:"historyDays,omitempty"`
}
Phase 2: AIBrix Integration with Enhanced iHPA
Once katalyst iHPA has predictive capabilities, AIBrix could integrate it seamlessly:
AIBrix PodAutoscaler API Changes:
// In aibrix/api/autoscaling/v1alpha1/podautoscaler_types.go
type PodAutoscalerSpec struct {
// Existing AIBrix fields
ScaleTargetRef corev1.ObjectReference `json:"scaleTargetRef"`
MinReplicas *int32 `json:"minReplicas,omitempty"`
MaxReplicas int32 `json:"maxReplicas"`
MetricsSources []MetricSource `json:"metricsSources,omitempty"`
ScalingStrategy ScalingStrategyType `json:"scalingStrategy"`
// New: Predictive scaling delegation to iHPA
// +optional
PredictiveScaling *PredictiveScaling `json:"predictiveScaling,omitempty"`
}
type PredictiveScaling struct {
Enabled bool `json:"enabled"`
LookAheadMinutes *int32 `json:"lookAheadMinutes,omitempty"`
HistoryDays *int32 `json:"historyDays,omitempty"`
}
Predictive Autoscaling Logic (Netflix-Style)
Core Philosophy: "Better to over-provision than under-serve"
// Netflix-style predictive autoscaling implementation
type PredictiveScaler interface {
// Predict required replica count based on historical patterns
PredictReplicas(ctx context.Context,
currentState WorkloadState,
historicalData []MetricPoint,
lookAheadMinutes int32) (int32, error)
}
// Main scaling decision logic - Netflix approach
func (r *PodAutoscalerController) calculateDesiredReplicas(
pa *PodAutoscaler) (int32, string, error) {
if pa.Spec.PredictiveScaling != nil && pa.Spec.PredictiveScaling.Enabled {
// Netflix style: Trust the prediction algorithm
predictedReplicas, err := r.predictiveScaler.PredictReplicas(
r.getCurrentState(pa),
r.getHistoricalMetrics(pa),
*pa.Spec.PredictiveScaling.LookAheadMinutes,
)
if err != nil {
// Only fallback when prediction completely fails
return r.calculateReactiveReplicas(pa), "reactive-fallback", nil
}
// Netflix philosophy: Aggressive predictive expansion
// Reactive supplements upward when current load exceeds prediction
reactiveReplicas := r.calculateReactiveReplicas(pa)
// Take maximum - better to over-provision than under-serve
return max(predictedReplicas, reactiveReplicas), "predictive", nil
}
// Pure reactive scaling (traditional HPA behavior)
return r.calculateReactiveReplicas(pa), "reactive", nil
}
// Traditional HPA-style calculation for reactive scaling
func (r *PodAutoscalerController) calculateReactiveReplicas(pa *PodAutoscaler) int32 {
currentMetrics := r.getCurrentMetrics(pa)
targetValue := r.getTargetValue(pa)
currentReplicas := pa.Status.CurrentReplicas
// Standard HPA formula
return int32(math.Ceil(float64(currentReplicas) * currentMetrics / targetValue))
}
Architecture & Usage Example
Netflix-Style Decision Flow:
βββββββββββββββββββββββ
β Historical Data β
β Collection β
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Predictive β βββΊ Predicted Replica Count
β Algorithm β (aggressive expansion)
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Algorithm Success? β
βββββββββββββββββββββββ
β
βββββ΄ββββ
βΌ βΌ
Success Failure
β β
βΌ βΌ
ββββββββββ ββββββββββββββββ
βUse β βFallback to β
βPredict β βReactive Only β
ββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Reactive Check β βββΊ Current load calculation
β (Supplementation) β
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β max(predicted, β βββΊ Netflix: Better to over-provision
β reactive) β
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Apply Safety Bounds β
β (min/max replicas) β
βββββββββββββββββββββββ
For LLM workloads in AIBrix:
apiVersion: autoscaling.aibrix.io/v1alpha1
kind: PodAutoscaler
metadata:
name: llm-service-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-service
minReplicas: 2 # Safety floor
maxReplicas: 20 # Safety ceiling
scalingStrategy: APA # AIBrix Predictive Algorithm
metricsSources:
- metricSourceType: pod
protocolType: http
path: /metrics
targetMetric: "kv_cache_utilization"
targetValue: "70"
predictiveScaling:
enabled: true
lookAheadMinutes: 480 # 8 minutes for LLM cold start
historyDays: 14 # Capture weekly business patterns
Example Scenario - Morning Preparation:
Time: 09:30
β’ current_replicas: 5
β’ current_kv_cache_utilization: 45%
β’ predicted_replicas: 12 (algorithm forecasts 10am meeting rush)
β’ reactive_replicas: 5 * 45% / 70% = 4 (current need)
Netflix Decision:
β’ Use prediction: 12 replicas (trust the algorithm)
β’ Reactive check: max(12, 4) = 12
β’ Action: Aggressive pre-scaling to 12 replicas
Result: LLM service ready before traffic spike hits
Example Scenario - Unexpected Load:
Time: 14:30
β’ current_replicas: 8 (from earlier prediction)
β’ current_kv_cache_utilization: 95% (viral content causes spike)
β’ predicted_replicas: 6 (normal afternoon pattern)
β’ reactive_replicas: 8 * 95% / 70% = 11 (current overload)
Netflix Decision:
β’ Prediction suggests: 6 replicas
β’ Current load needs: 11 replicas
β’ max(6, 11) = 11 replicas
β’ Action: Reactive scaling supplements prediction
Result: System handles unexpected load gracefully
Benefits of This Netflix-Style Approach
- Simple & Effective: No complex confidence thresholds - trust the prediction
- Over-provision Philosophy: Better to waste some resources than lose service quality
- Predictive-Led: Proactive scaling prevents performance degradation
- Reactive Safety Net: Handles scenarios the prediction didn't anticipate
- Production Proven: Based on Netflix's battle-tested approach at scale
- LLM-Optimized: Accounts for long cold start times and capacity planning needs
Discussion Points
- Contribution Strategy: Does contributing predictive capabilities to katalyst iHPA align with the project's roadmap?
- Implementation Timeline: Should me prioritize the katalyst enhancement first?
- Algorithm Choice: For LLM workloads, should the predictive algorithm consider GPU utilization, model loading patterns, and request queuing in addition to basic metrics?
I believe this collaborative approach following Netflix's proven methodology could create significant value for both projects and the broader community. Looking forward to your thoughts!
Thanks!
What is the update on this issue, is it still expected to stay a priority?
@igor-susic1 yes. @Belyenochi get some errands recently, we'd like to ship this feature as fast as we can
@Jeffwan thanks for the reply. If there is any help needed I would love to contribute I myself would love to get hands on the feature as soon as possible.