flagger icon indicating copy to clipboard operation
flagger copied to clipboard

Prevent primary hpa collision for keda scaled objects when migrating from an hpa

Open jdgeisler opened this issue 7 months ago • 2 comments

This MR fixes https://github.com/fluxcd/flagger/issues/1646 where Flagger fails to create the primary hpa for a keda scaled object when you are migrating from a regular hpa to a scaled object in a canary.

Using the scaledobject.keda.sh/transfer-hpa-ownership: "true" annotation, a keda scaled object can take ownership of an already existing HPA. Currently, annotations are not copied over from the canary scaled object to the primary scaled object, so I am adding this functionality.

Additionally, we also need to add the Advanced.HorizontalPodAutoscalerConfig.Name field to the ScaledObjectSpec so that we can specify the HPA name that the scaled object will take ownership of. When creating the primary scaled object from the canary, we use the same hpa name but append -primary to the end of it.

Testing

With the above changes, we can then successfully use the transfer hpa ownership annotation and specify the HPA that the scaled objects will take ownership of. This prevents KEDA from failing to create the new primary hpa since it already exists. Instead it just uses and modifies the already existing one.

Applying the following diff to test in a kubernetes cluster:

fortio, fortio-server-deployment-2, Canary (flagger.app) has changed:
  # Source: fortio/templates/fortio.yaml
  apiVersion: flagger.app/v1beta1
  kind: Canary
  metadata:
    name: fortio-server-deployment-2
    namespace: fortio
  spec:
    analysis:
      interval: 30s
      maxWeight: 50
      metrics:
      - interval: 30s
        name: request-success-rate
        thresholdRange:
          min: 99
      stepWeight: 10
      threshold: 10
      webhooks:
      - metadata:
          async: "on"
          c: "6"
          labels: fortio-server-deployment-2
          load: Start
          nocatchup: "on"
          qps: "12"
          save: "on"
          stdclient: "on"
          t: 600s
          uniform: "on"
          url: http://fortio-server-deployment-2-canary.fortio.svc.cluster.local:8080/
        name: generateLoad
        timeout: 60s
        type: pre-rollout
        url: http://fortio-client.fortio.svc.cluster.local:8080/fortio/rest/run?jsonPath=.metadata
    progressDeadlineSeconds: 600
    autoscalerRef:
-     apiVersion: autoscaling/v2
-     kind: HorizontalPodAutoscaler
+     apiVersion: keda.sh/v1alpha1
+     kind: ScaledObject
      name: fortio-server-deployment-2
    targetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: fortio-server-deployment-2
    service:
      port: 8080
      targetPort: 8080
fortio, fortio-server-deployment-2, HorizontalPodAutoscaler (autoscaling) has been removed:
- # Source: fortio/templates/fortio.yaml
- apiVersion: autoscaling/v2
- kind: HorizontalPodAutoscaler
- metadata:
-   name: fortio-server-deployment-2
-   namespace: fortio
- spec:
-   maxReplicas: 4
-   metrics:
-   - resource:
-       name: cpu
-       target:
-         averageUtilization: 99
-         type: Utilization
-     type: Resource
-   minReplicas: 2
-   scaleTargetRef:
-     apiVersion: apps/v1
-     kind: Deployment
-     name: fortio-server-deployment-2
+
fortio, fortio-server-deployment-2, ScaledObject (keda.sh) has been added:
-
+ # Source: fortio/templates/fortio.yaml
+ apiVersion: keda.sh/v1alpha1
+ kind: ScaledObject
+ metadata:
+   name: fortio-server-deployment-2
+   namespace: fortio
+   annotations:
+     scaledobject.keda.sh/transfer-hpa-ownership: "true"
+ spec:
+   advanced:
+     horizontalPodAutoscalerConfig:
+       name: fortio-server-deployment-2
+   scaleTargetRef:
+     name: fortio-server-deployment-2
+   minReplicaCount: 2
+   maxReplicaCount: 4
+   triggers:
+   - type: prometheus
+     metadata:
+       serverAddress: http://kube-prometheus-stack-prometheus:9090/prometheus
+       threshold: '100'
+       query: sum(rate(istio_requests_total{destination_service_name="fortio-server-2"}[1m]))
+   - type: cpu
+     metricType: Utilization
+     metadata:
+       value: "99"

See that the primary scaled object successfully takes over ownership of the primary hpa and it no longer fails.

(⎈|arn:aws:eks:us-east-2:824163854805:cluster/rokumesh-auto-jgeisler-t7:fortio)jgeisler@PK4V290RH9 platform % k get so
NAME                                 SCALETARGETKIND      SCALETARGETNAME                      MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE    FALLBACK   PAUSED    AGE
fortio-server-deployment-2           apps/v1.Deployment   fortio-server-deployment-2           2     4     prometheus                    True    Unknown   False      True      170m
fortio-server-deployment-2-primary   apps/v1.Deployment   fortio-server-deployment-2-primary   2     4     prometheus                    True    False     False      Unknown   170m
(⎈|arn:aws:eks:us-east-2:824163854805:cluster/rokumesh-auto-jgeisler-t7:fortio)jgeisler@PK4V290RH9 platform % k get hpa
NAME                                 REFERENCE                                       TARGETS               MINPODS   MAXPODS   REPLICAS   AGE
fortio-server-deployment-2-primary   Deployment/fortio-server-deployment-2-primary   0/100 (avg), 4%/99%   2         4         2          13d

jdgeisler avatar Jul 10 '24 15:07 jdgeisler