Add Circuit Breaker Policy for HPA on Bad Metrics
🚀 Feature Description and Motivation
Currently, when PodAutoscaler (HPA/KPA/APA) receives abnormal or invalid metrics (e.g., NaN, outliers, sudden spikes) or unexpected behaviors like error rate going up etc, it may still continue scaling actions, which can lead to instability in the system. To improve resilience, we should introduce a circuit breaker mechanism in the autoscaler.
Introduce a configurable policy that defines how the autoscaler should behave when encountering bad or suspicious metrics:
Circuit breaker trigger conditions
- Invalid values (e.g., NaN, negative, impossible values)
- Abnormal fluctuations outside configured tolerances
Policy options once triggered
- Extend to maximum: Scale target replicas to the maximum defined in spec and hold there until metrics recover.
- Freeze at current state: Keep the current replica count unchanged until metrics return to normal. -(Future) Fallback behavior: Use alternative metrics source or default values.
- webhook
spec:
scalingStrategy:
circuitBreaker:
enabled: true
trigger: invalidMetrics
action: freeze # options: [freeze, max]
Use Case
to protect the services going to unexpected situation
Proposed Solution
No response
/assign
We currently define scalingStrategy as a simple string enum in the PodAutoscalerSpec: https://github.com/vllm-project/aibrix/blob/8b09568cc4a8de7971b1782b940ff8cbc626bd91/api/autoscaling/v1alpha1/podautoscaler_types.go#L70-L72
So if we change a field type from a simple string to a complex object, existing CRD instances will become invalid.
We can of course create a webhook that will convert between those two, but I was thinking we can just add a new ScalingPolicy field ( and maybe mark the old ScalingStrategy as deperecated - we can keep that IMO ).
Something like that:
type PodAutoscalerSpec struct {
// ... other fields ...
// ScalingStrategy defines the strategy to use for scaling.
// DEPRECATED: Use ScalingPolicy instead. This field is kept for backward compatibility.
// +kubebuilder:validation:Enum={HPA,KPA,APA}
// +optional
ScalingStrategy *ScalingStrategyType `json:"scalingStrategy,omitempty"`
// ScalingPolicy defines the strategy and policies to use for scaling, including circuit breaker configuration.
// If both ScalingStrategy and ScalingPolicy are specified, ScalingPolicy takes precedence.
// +optional
ScalingPolicy *ScalingPolicyConfig `json:"scalingPolicy,omitempty"`
}
type ScalingPolicyConfig struct {
// Type defines the scaling algorithm to use.
// +kubebuilder:validation:Enum={HPA,KPA,APA}
Type ScalingStrategyType `json:"type"`
// CircuitBreaker defines the circuit breaker configuration for handling invalid or abnormal metrics.
// +optional
CircuitBreaker *CircuitBreakerConfig `json:"circuitBreaker,omitempty"`
}
Then from a user perspective it will like:
# New format with circuit breaker
scalingPolicy:
type: HPA
circuitBreaker:
enabled: true
trigger: invalidMetrics
action: freeze # options: [freeze, max]
We can extend scalingPolicy as needed.
@Jeffwan wdyt?