aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

hpa reconcile keep generating the new recommendation

Open Jeffwan opened this issue 2 months ago • 4 comments

🐛 Describe the bug

  1. When we update the hpa configuration, it will enqueue the objects and immediately generate the new recommendations.

  2. Controller normally updates the CR object and trigger the new reconcilation loop

We should use default freeze window to skip frequent updates.

I0930 22:50:10.750098       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:10.750123       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[8,21,36,51,44,10,56.99999999999999,30,19,14.000000000000002]
I0930 22:50:10.750137       1 autoscaler.go:298] "Metrics aggregated" currentValue=43.68571428571429 trend=0 confidence=0 podCount=10
I0930 22:50:10.750143       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:10.750156       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":8,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":43.68571428571429,"trend":0}}
I0930 22:50:10.759901       1 podautoscaler_controller.go:523] "Successfully rescaled" PodAutoscaler="default/podautoscaler-mock-llama2-7b" currentReplicas=10 desiredReplicas=8 reason="All metrics below target"
I0930 22:50:10.983530       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:10.983570       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[12,56.99999999999999,72,71,72,54,33,35,7.000000000000001,5]
I0930 22:50:10.983597       1 autoscaler.go:298] "Metrics aggregated" currentValue=45.51428571428571 trend=0 confidence=0 podCount=10
I0930 22:50:10.983616       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:10.983658       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":7,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":45.51428571428571,"trend":0}}
E0930 22:50:10.997355       1 controller.go:316] "msg"="Reconciler error" "error"="failed to apply scaling for Deployment/default/mock-llama2-7b: Operation cannot be fulfilled on deployments.apps \"mock-llama2-7b\": the object has been modified; please apply your changes to the latest version and try again" "PodAutoscaler"={"name":"podautoscaler-mock-llama2-7b","namespace":"default"} "controller"="podautoscaler" "controllerGroup"="autoscaling.aibrix.ai" "controllerKind"="PodAutoscaler" "name"="podautoscaler-mock-llama2-7b" "namespace"="default" "reconcileID"="34119e05-3292-485c-8a90-4b9e124143f2"
I0930 22:50:11.060594       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:11.060615       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[82,62,20,60,34,31,78,60,5,84]
I0930 22:50:11.060628       1 autoscaler.go:298] "Metrics aggregated" currentValue=46.275000000000006 trend=0 confidence=0 podCount=10
I0930 22:50:11.060637       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:11.060652       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":7,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":46.275000000000006,"trend":0}}
I0930 22:50:11.069311       1 podautoscaler_controller.go:523] "Successfully rescaled" PodAutoscaler="default/podautoscaler-mock-llama2-7b" currentReplicas=8 desiredReplicas=7 reason="All metrics below target"
I0930 22:50:11.139566       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:11.139608       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[91,24,48,28.000000000000004,47,28.000000000000004,80,71,84,43]
I0930 22:50:11.139631       1 autoscaler.go:298] "Metrics aggregated" currentValue=46.62499999999999 trend=0 confidence=0 podCount=10
I0930 22:50:11.139643       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:11.139670       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":6,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":46.62499999999999,"trend":0}}
E0930 22:50:11.150610       1 controller.go:316] "msg"="Reconciler error" "error"="failed to apply scaling for Deployment/default/mock-llama2-7b: Operation cannot be fulfilled on deployments.apps \"mock-llama2-7b\": the object has been modified; please apply your changes to the latest version and try again" "PodAutoscaler"={"name":"podautoscaler-mock-llama2-7b","namespace":"default"} "controller"="podautoscaler" "controllerGroup"="autoscaling.aibrix.ai" "controllerKind"="PodAutoscaler" "name"="podautoscaler-mock-llama2-7b" "namespace"="default" "reconcileID"="8b5a5f2b-7586-46c1-91e1-896bc81b29d4"
I0930 22:50:11.197372       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:11.197393       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[26,35,59,71,17,70,98,80,68,17]
I0930 22:50:11.197405       1 autoscaler.go:298] "Metrics aggregated" currentValue=46.5875 trend=0 confidence=0 podCount=10
I0930 22:50:11.197414       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:11.197427       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":6,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":46.5875,"trend":0}}
I0930 22:50:11.205956       1 podautoscaler_controller.go:523] "Successfully rescaled" PodAutoscaler="default/podautoscaler-mock-llama2-7b" currentReplicas=7 desiredReplicas=6 reason="All metrics below target"
I0930 22:50:11.254147       1 autoscaler.go:269] "Collecting metrics" source="default/podautoscaler-mock-llama2-7b" total pods=10 metrics available pods=10
I0930 22:50:11.254169       1 autoscaler.go:273] "Processing metrics snapshot" source="default/podautoscaler-mock-llama2-7b" values=[68,76,12,47,53,30,56.00000000000001,42,5,68]
I0930 22:50:11.254181       1 autoscaler.go:298] "Metrics aggregated" currentValue=45.5375 trend=0 confidence=0 podCount=10
I0930 22:50:11.254191       1 autoscaler.go:320] "Computing scaling recommendation" source="default/podautoscaler-mock-llama2-7b" algorithm="apa"
I0930 22:50:11.254205       1 autoscaler.go:326] "Scaling recommendation computed" source="default/podautoscaler-mock-llama2-7b" algorithm="apa" recommendation={"DesiredReplicas":5,"Confidence":0,"Reason":"apa scaling based on current metrics","Algorithm":"apa","ScaleValid":true,"Metadata":{"current_value":45.5375,"trend":0}}

Steps to Reproduce

update hpa configuration

Expected behavior

It should be stabilized

Environment

nightly

Jeffwan avatar Sep 30 '25 22:09 Jeffwan

/assign

will take a look this one

googs1025 avatar Oct 09 '25 01:10 googs1025

@Jeffwan There is a simple way(just like hpa resource in k8s):

Add a stabilizationWindowSeconds field to PodAutoscalerSpec (default: 300s):

// StabilizationWindowSeconds is the number of seconds the autoscaler should wait
// before scaling(including up and down) again after a successful scale operation.
// This prevents rapid fluctuations in replica count.
// Defaults to 300 (5 minutes) if not specified.
// +optional
// +kubebuilder:default=300
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=3600
StabilizationWindowSeconds *int32 `json:"stabilizationWindowSeconds,omitempty"`

In the reconcile loop, before calling executeScalingPipeline, check if we’re within the stabilization window after the last successful scale:

googs1025 avatar Oct 09 '25 02:10 googs1025

or we can use like this to set window cool time for use:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60 
  scaleDown:
    stabilizationWindowSeconds: 600 

googs1025 avatar Oct 09 '25 09:10 googs1025

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: ss-pool-decode
  namespace: default
  annotations:
    autoscaling.aibrix.ai/storm-service-mode: "pool"
spec:
  scaleTargetRef:
    apiVersion: orchestration.aibrix.ai/v1alpha1
    kind: StormService
    name: ss-pool

  # Select the decode role within the StormService
  subTargetSelector:
    roleName: decode

  minReplicas: 3
  maxReplicas: 30
  scalingStrategy: APA

  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: "8000"
      path: /metrics
      targetMetric: "decode_batch_utilization"
      targetValue: "70"

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  
    scaleDown:
      stabilizationWindowSeconds: 600

googs1025 avatar Oct 09 '25 09:10 googs1025