aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Add Circuit Breaker Policy for HPA on Bad Metrics

Open Jeffwan opened this issue 3 months ago • 2 comments

🚀 Feature Description and Motivation

Currently, when PodAutoscaler (HPA/KPA/APA) receives abnormal or invalid metrics (e.g., NaN, outliers, sudden spikes) or unexpected behaviors like error rate going up etc, it may still continue scaling actions, which can lead to instability in the system. To improve resilience, we should introduce a circuit breaker mechanism in the autoscaler.

Introduce a configurable policy that defines how the autoscaler should behave when encountering bad or suspicious metrics:

Circuit breaker trigger conditions

  • Invalid values (e.g., NaN, negative, impossible values)
  • Abnormal fluctuations outside configured tolerances

Policy options once triggered

  • Extend to maximum: Scale target replicas to the maximum defined in spec and hold there until metrics recover.
  • Freeze at current state: Keep the current replica count unchanged until metrics return to normal. -(Future) Fallback behavior: Use alternative metrics source or default values.
  • webhook
spec:
  scalingStrategy:
    circuitBreaker:
      enabled: true
      trigger: invalidMetrics
      action: freeze # options: [freeze, max]

Use Case

to protect the services going to unexpected situation

Proposed Solution

No response

Jeffwan avatar Sep 22 '25 21:09 Jeffwan

/assign

omerap12 avatar Sep 26 '25 08:09 omerap12

We currently define scalingStrategy as a simple string enum in the PodAutoscalerSpec: https://github.com/vllm-project/aibrix/blob/8b09568cc4a8de7971b1782b940ff8cbc626bd91/api/autoscaling/v1alpha1/podautoscaler_types.go#L70-L72

So if we change a field type from a simple string to a complex object, existing CRD instances will become invalid. We can of course create a webhook that will convert between those two, but I was thinking we can just add a new ScalingPolicy field ( and maybe mark the old ScalingStrategy as deperecated - we can keep that IMO ). Something like that:

type PodAutoscalerSpec struct {
	// ... other fields ...

	// ScalingStrategy defines the strategy to use for scaling.
	// DEPRECATED: Use ScalingPolicy instead. This field is kept for backward compatibility.
	// +kubebuilder:validation:Enum={HPA,KPA,APA}
	// +optional
	ScalingStrategy *ScalingStrategyType `json:"scalingStrategy,omitempty"`

	// ScalingPolicy defines the strategy and policies to use for scaling, including circuit breaker configuration.
	// If both ScalingStrategy and ScalingPolicy are specified, ScalingPolicy takes precedence.
	// +optional
	ScalingPolicy *ScalingPolicyConfig `json:"scalingPolicy,omitempty"`
}

type ScalingPolicyConfig struct {
	// Type defines the scaling algorithm to use.
	// +kubebuilder:validation:Enum={HPA,KPA,APA}
	Type ScalingStrategyType `json:"type"`

	// CircuitBreaker defines the circuit breaker configuration for handling invalid or abnormal metrics.
	// +optional
	CircuitBreaker *CircuitBreakerConfig `json:"circuitBreaker,omitempty"`
}

Then from a user perspective it will like:

# New format with circuit breaker
scalingPolicy:
  type: HPA
  circuitBreaker:
    enabled: true
    trigger: invalidMetrics
    action: freeze # options: [freeze, max]

We can extend scalingPolicy as needed. @Jeffwan wdyt?

omerap12 avatar Sep 27 '25 15:09 omerap12