aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Support predictive autoscaling for LLM inference

Open Jeffwan opened this issue 4 months ago β€’ 8 comments

πŸš€ Feature Description and Motivation

We used to do a lot of work around reactive autoscaling, however, the model bootstrap still take looks long time. Instead of traditional autoscaling, we want to provide another option - the time series prediction method to solve the latency issue.

Use Case

LLM autoscaling

Proposed Solution

No response

Jeffwan avatar Aug 07 '25 21:08 Jeffwan

Hi @Jeffwan! I'm very interested in this predictive autoscaling feature.

I have experience with time series forecasting (using Prophet/LSTM) and Kubernetes autoscaling. I understand the pain point of model bootstrap latency in LLM inference.

I'd like to work on this issue. My approach would be:

  1. Design a time series prediction module for traffic forecasting
  2. Integrate with existing autoscaling mechanisms
  3. Add configuration options for prediction parameters

Could I be assigned to this issue? I estimate it would take 3-4 weeks to deliver a solution.

Belyenochi avatar Aug 14 '25 03:08 Belyenochi

thanks for driving this effort. @Belyenochi I just assign this issue to you

Jeffwan avatar Aug 18 '25 07:08 Jeffwan

Hi @Jeffwan, Thanks for assigning this to me! I'm excited to work on #1418.

I've been researching how Netflix Scryer and Google Cloud handle predictive autoscaling, and I'm curious about your thoughts on applying these approaches to LLM workloads.

From what I found, there seem to be two main philosophies:

Netflix Scryer:

  • Scale-up: Aggressive predictive expansion, reactive can supplement
  • Scale-down: Predictive sets safety floor, prevents over-contraction
  • Philosophy: "Better to over-provision than under-serve"

Google Cloud:

  • Scale-up: Confidence-driven blending of predictive + reactive signals
  • Scale-down: Predictive-driven with reactive override for emergencies
  • Philosophy: "Optimize for cost efficiency while maintaining SLO"

Initial thoughts for LLM scenarios:

Given the LLM cold start challenges (5-8min model loading), I'm thinking a Netflix-style asymmetric approach might work well:

Scale-Up Strategy:

  • What do you think about proactive expansion based on predicted traffic patterns?
  • Decision logic could be: max(predicted_need, reactive_suggestion)
  • The rationale being that cold starts are so expensive for LLMs that reactive scaling might be too late

Scale-Down Strategy:

  • I'm wondering if we should use predictive floor protection to prevent over-aggressive scale-down
  • Maybe something like: max(predicted_minimum, reactive_suggestion)
  • This could help avoid those painful cold start delays when traffic picks back up

Technical Implementation Questions:

Currently AIBRIX pulls metrics directly from vLLM pods. For predictive capabilities, I'm thinking we could:

  • Add Prometheus integration: Query historical metrics for pattern analysis - does this align with your API refactor plans?
  • Implement Netflix Scryer-style prediction:
  • Weekly periodicity analysis (they use same-weekday patterns since Tuesday vs Tuesday is more similar than Tuesday vs Wednesday)
  • 8-12 weeks lookback window with time-decay weighting
  • 10-minute aggregation windows for pattern detection
  • Extend PodAutoscaler: Add predictive metricSource alongside existing reactive ones

Do you think this approach makes sense for our use cases? I'm curious about:

  • Whether our production workloads show clear weekly/daily patterns that would benefit from this kind of prediction
  • Any constraints from the API refactor that might influence the design
  • Your thoughts on the Netflix vs Google philosophy for our scenarios
  • Whether Prometheus as the single data source would be sufficient for this

References:

  • Netflix Scryer: http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
  • Netflix Scryer Part 2: https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-part-2-bb9c4f9b9385
  • Google Cloud predictive autoscaling: https://cloud.google.com/compute/docs/autoscaler/predictive-autoscaling

Perfect timing on the API refactor - having a clean, well-designed autoscaling API will make integrating the predictive component much smoother! While we wait for the new baseline, I can start working on the predictive algorithm design and Prometheus integration.

Looking forward to discussing this further!

Thanks!

Belyenochi avatar Aug 18 '25 10:08 Belyenochi

@Belyenochi Great start! I love this rough design. Here’s how I’d evaluate and shape predictive autoscaling for LLM workloads in AIBrix.

  • I think both the Netflix and Google philosophies are valuable, and the choice between them is mostly a strategy question If I understand correctly. I do not have strong preference at this moment. I will learn and give some comments later
  • Unified entry point: expose predictive and reactive as sources within the same API, so users just declare whether they want prediction enabled, not learn a different CRD or controller. Let's try to what changes need to be done in https://github.com/vllm-project/aibrix/blob/main/api/autoscaling/v1alpha1/podautoscaler_types.go. If the API is totally different, we could consider to use a new one.
  • Bytedance used to work on intelligent HPA in project katalyst. I linked the reference there. My original ideas is to bring the code back to AIBrix to save some efforts since we are from same team. :D Please help evaluate the ihpa and see whether we can save some efforts if we build a new solution on top of it.
  • Model specific metrics: Most time-series prediction solution requires x days traffic to be able to provide prediction, let me know anything I can support your development work

references:

  • https://dl.acm.org/doi/pdf/10.1145/3342195.3387524 Google's autopilot paper
  • https://github.com/kubewharf/katalyst-api/blob/main/pkg/apis/autoscaling/v1alpha2/ihpa.go
  • https://github.com/kubewharf/katalyst-core/tree/main/pkg/controller/ihpa

Jeffwan avatar Aug 19 '25 12:08 Jeffwan

Hi @Jeffwan,

Thanks for the detailed feedback! After analyzing the katalyst iHPA implementation, I have some findings and would like to discuss the best path forward:

Current State Analysis

I examined the katalyst iHPA code (ihpa.go and controller and found that the current iHPA doesn't have time-series prediction capabilities yet. It's essentially a wrapper around standard Kubernetes HPA with enhanced integration for SPD (Service Profile Descriptor) and VirtualWorkload management.

Proposed Approach

Given your suggestion to leverage existing katalyst work, I'm thinking the most beneficial approach would be:

Phase 1: Enhance katalyst iHPA with Predictive Capabilities

Since katalyst is an open source project, I'd like to contribute the predictive autoscaling implementation directly to iHPA. This would benefit the entire cloud-native community and align perfectly with the goals of Issue #1418.

Proposed API Extension for katalyst iHPA:

// In katalyst-api/pkg/apis/autoscaling/v1alpha2/ihpa.go
type IntelligentHorizontalPodAutoscalerSpec struct {
    // Existing fields...
    Autoscaler AutoscalerSpec `json:"autoscaler"`
    
    // New: Predictive scaling configuration
    // +optional
    PredictiveConfig *PredictiveConfig `json:"predictiveConfig,omitempty"`
}

type PredictiveConfig struct {
    Enabled bool `json:"enabled"`
    LookAheadMinutes *int32 `json:"lookAheadMinutes,omitempty"`
    HistoryDays *int32 `json:"historyDays,omitempty"`
}

Phase 2: AIBrix Integration with Enhanced iHPA

Once katalyst iHPA has predictive capabilities, AIBrix could integrate it seamlessly:

AIBrix PodAutoscaler API Changes:

// In aibrix/api/autoscaling/v1alpha1/podautoscaler_types.go
type PodAutoscalerSpec struct {
    // Existing AIBrix fields
    ScaleTargetRef corev1.ObjectReference `json:"scaleTargetRef"`
    MinReplicas *int32 `json:"minReplicas,omitempty"`
    MaxReplicas int32 `json:"maxReplicas"`
    MetricsSources []MetricSource `json:"metricsSources,omitempty"`
    ScalingStrategy ScalingStrategyType `json:"scalingStrategy"`
    
    // New: Predictive scaling delegation to iHPA
    // +optional
    PredictiveScaling *PredictiveScaling `json:"predictiveScaling,omitempty"`
}

type PredictiveScaling struct {
    Enabled bool `json:"enabled"`
    LookAheadMinutes *int32 `json:"lookAheadMinutes,omitempty"`
    HistoryDays *int32 `json:"historyDays,omitempty"`
}

Predictive Autoscaling Logic (Netflix-Style)

Core Philosophy: "Better to over-provision than under-serve"

// Netflix-style predictive autoscaling implementation
type PredictiveScaler interface {
    // Predict required replica count based on historical patterns
    PredictReplicas(ctx context.Context, 
                   currentState WorkloadState,
                   historicalData []MetricPoint,
                   lookAheadMinutes int32) (int32, error)
}

// Main scaling decision logic - Netflix approach
func (r *PodAutoscalerController) calculateDesiredReplicas(
    pa *PodAutoscaler) (int32, string, error) {
    
    if pa.Spec.PredictiveScaling != nil && pa.Spec.PredictiveScaling.Enabled {
        // Netflix style: Trust the prediction algorithm
        predictedReplicas, err := r.predictiveScaler.PredictReplicas(
            r.getCurrentState(pa),
            r.getHistoricalMetrics(pa),
            *pa.Spec.PredictiveScaling.LookAheadMinutes,
        )
        
        if err != nil {
            // Only fallback when prediction completely fails
            return r.calculateReactiveReplicas(pa), "reactive-fallback", nil
        }
        
        // Netflix philosophy: Aggressive predictive expansion
        // Reactive supplements upward when current load exceeds prediction
        reactiveReplicas := r.calculateReactiveReplicas(pa)
        
        // Take maximum - better to over-provision than under-serve
        return max(predictedReplicas, reactiveReplicas), "predictive", nil
    }
    
    // Pure reactive scaling (traditional HPA behavior)
    return r.calculateReactiveReplicas(pa), "reactive", nil
}

// Traditional HPA-style calculation for reactive scaling
func (r *PodAutoscalerController) calculateReactiveReplicas(pa *PodAutoscaler) int32 {
    currentMetrics := r.getCurrentMetrics(pa)
    targetValue := r.getTargetValue(pa)
    currentReplicas := pa.Status.CurrentReplicas
    
    // Standard HPA formula
    return int32(math.Ceil(float64(currentReplicas) * currentMetrics / targetValue))
}

Architecture & Usage Example

Netflix-Style Decision Flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Historical Data     β”‚
β”‚ Collection          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Predictive          β”‚ ──► Predicted Replica Count
β”‚ Algorithm           β”‚     (aggressive expansion)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Algorithm Success?  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
    β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
    β–Ό       β–Ό
Success   Failure
    β”‚       β”‚
    β–Ό       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Use     β”‚ β”‚Fallback to   β”‚
β”‚Predict β”‚ β”‚Reactive Only β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reactive Check      β”‚ ──► Current load calculation
β”‚ (Supplementation)   β”‚     
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ max(predicted,      β”‚ ──► Netflix: Better to over-provision
β”‚     reactive)       β”‚     
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Apply Safety Bounds β”‚
β”‚ (min/max replicas)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For LLM workloads in AIBrix:

apiVersion: autoscaling.aibrix.io/v1alpha1
kind: PodAutoscaler
metadata:
  name: llm-service-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  minReplicas: 2          # Safety floor
  maxReplicas: 20         # Safety ceiling  
  scalingStrategy: APA    # AIBrix Predictive Algorithm
  metricsSources:
  - metricSourceType: pod
    protocolType: http
    path: /metrics
    targetMetric: "kv_cache_utilization"
    targetValue: "70"
  predictiveScaling:
    enabled: true
    lookAheadMinutes: 480  # 8 minutes for LLM cold start
    historyDays: 14        # Capture weekly business patterns

Example Scenario - Morning Preparation:

Time: 09:30
β€’ current_replicas: 5
β€’ current_kv_cache_utilization: 45%
β€’ predicted_replicas: 12 (algorithm forecasts 10am meeting rush)
β€’ reactive_replicas: 5 * 45% / 70% = 4 (current need)

Netflix Decision:
β€’ Use prediction: 12 replicas (trust the algorithm)
β€’ Reactive check: max(12, 4) = 12
β€’ Action: Aggressive pre-scaling to 12 replicas

Result: LLM service ready before traffic spike hits

Example Scenario - Unexpected Load:

Time: 14:30
β€’ current_replicas: 8 (from earlier prediction)
β€’ current_kv_cache_utilization: 95% (viral content causes spike)
β€’ predicted_replicas: 6 (normal afternoon pattern)
β€’ reactive_replicas: 8 * 95% / 70% = 11 (current overload)

Netflix Decision:
β€’ Prediction suggests: 6 replicas
β€’ Current load needs: 11 replicas
β€’ max(6, 11) = 11 replicas
β€’ Action: Reactive scaling supplements prediction

Result: System handles unexpected load gracefully

Benefits of This Netflix-Style Approach

  1. Simple & Effective: No complex confidence thresholds - trust the prediction
  2. Over-provision Philosophy: Better to waste some resources than lose service quality
  3. Predictive-Led: Proactive scaling prevents performance degradation
  4. Reactive Safety Net: Handles scenarios the prediction didn't anticipate
  5. Production Proven: Based on Netflix's battle-tested approach at scale
  6. LLM-Optimized: Accounts for long cold start times and capacity planning needs

Discussion Points

  1. Contribution Strategy: Does contributing predictive capabilities to katalyst iHPA align with the project's roadmap?
  2. Implementation Timeline: Should me prioritize the katalyst enhancement first?
  3. Algorithm Choice: For LLM workloads, should the predictive algorithm consider GPU utilization, model loading patterns, and request queuing in addition to basic metrics?

I believe this collaborative approach following Netflix's proven methodology could create significant value for both projects and the broader community. Looking forward to your thoughts!

Thanks!

Belyenochi avatar Aug 19 '25 13:08 Belyenochi

What is the update on this issue, is it still expected to stay a priority?

igor-susic1 avatar Sep 09 '25 12:09 igor-susic1

@igor-susic1 yes. @Belyenochi get some errands recently, we'd like to ship this feature as fast as we can

Jeffwan avatar Oct 13 '25 06:10 Jeffwan

@Jeffwan thanks for the reply. If there is any help needed I would love to contribute I myself would love to get hands on the feature as soon as possible.

igor-susic1 avatar Oct 13 '25 06:10 igor-susic1