argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

HPA scales up stable replicaset to max when doing canary deploys

Open markh123 opened this issue 1 year ago • 14 comments

Checklist:

  • [ ] I've included steps to reproduce the bug.
  • [ ] I've included the version of argo rollouts.

Describe the bug

We use canary deploys via argo rollouts to to deploy services. In our services that use the k8s horizontal pod autoscaler with scaling configured via memory limits (we don't see the same issue with CPU scaling) we see the stable replica set scale up to max replicas during each deploy and then scale back down after the deploy is complete.

Looking at the metrics reported for the service via both kubectl describe hpa and kubectl get hpa during the scale ups I never see the metrics reported exceeding the limits nor do I see the metrics exceeding the limits in the corresponding prometheus metrics

Metrics:                                                  ( current / target )
  resource memory on pods  (as a percentage of request):  43% (486896088615m) / 70%
  resource cpu on pods  (as a percentage of request):     3% (43m) / 70% 

However, I still see HPA events scaling up the service due to memory:

Normal  SuccessfulRescale  2m58s (x8 over 2d20h)  horizontal-pod-autoscaler  New size: 13; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  11s (x16 over 2d21h)   horizontal-pod-autoscaler  New size: 15; reason: memory resource utilization (percentage of request) above target

The HPA configuration works as expected during normal operation and only seems to have issues during argo rollout deploys which is why I think this is likely a bug with how argo rollouts interacts with HPA.

Note that the replica count doesn't always go to max. If we increase the memory for the pods and/or increase the memory scaling limit we can decrease the replicas that are added during deployment. However, this solution isn't great as it adds additional cost to run machines with much more memory than we need just to reduce this problem.

To Reproduce

I haven't set up an isolated reproduction but I think all that is necessary is deploying a service with memory based HPA that operates at roughly 50% memory capacity with a 70% memory scaling limit. Then you can perform a canary deploy for that service and it should scale up during the deploy.

Expected behavior

I expect the stable replica set to not scale up during the deploy unless an increase in traffic/utilization necessitates the increase.

Screenshots

The below screenshot shows the replica count during a deploy. The green line is the stable set and the yellow line is the canary set. You can see how it scales up during the deployment and then back down afterwards.

Screen Shot 2023-06-23 at 2 12 06 PM

Version

v1.5.1

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

markh123 avatar Jun 23 '23 18:06 markh123

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Aug 23 '23 02:08 github-actions[bot]

/remove stale (didn't work but same effect)

jandersen-plaid avatar Aug 31 '23 17:08 jandersen-plaid

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Nov 01 '23 02:11 github-actions[bot]

  • ArgoCd version : 1.6.2
  • I have Keda to manage HPA (most likely not relevant with the issue)
  • I also have OPA Gatekeeper mutations (it might be relevant, I will try to share the logs)

A similar issue is happening with my setup. It only happens when a traffic control configuration is in place and HPA has memory-based autoscaling. I can only avoid this when I set the memory utilization threshold to a number that is unrealistically high, like 95% or 99%.

No matter what the average or maximum memory utilisation is during the rollout, the HPA reports New size: X; reason: memory resource utilization (percentage of request) above target and scales the stable and the canary to max only at the last step of the rollout.

This is my HPA (managed by Keda)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/managed-by: keda-operator
    app.kubernetes.io/name: keda-hpa-xxx
    app.kubernetes.io/part-of: xxx
    app.kubernetes.io/version: 2.12.1
    scaledobject.keda.sh/name: xxx
  name: keda-hpa-xxx
  namespace: xxxxx
  ownerReferences:
    - apiVersion: keda.sh/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ScaledObject
      name: xxx
spec:
  maxReplicas: 40
  metrics:
    - external:
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
        target:
          averageValue: '1'
          type: AverageValue
      type: External
    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization
      type: Resource
    - resource:
        name: cpu
        target:
          averageUtilization: 50
          type: Utilization
      type: Resource
  minReplicas: 5
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: xxx
status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: recommended size matches current size
      reason: ReadyForNewScale
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from cpu
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 200m
          value: '0'
        metric:
          name: s1-cron-....
          selector:
            matchLabels:
              scaledobject.keda.sh/name: ...
      type: External
    - resource:
        current:
          averageUtilization: 52
          averageValue: 275596902400m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 49
          averageValue: 249m
        name: cpu
      type: Resource
  currentReplicas: 5
  desiredReplicas: 5
  lastScaleTime: '2024-01-03T10:40:37Z'

The scaling up continues until the max and after some time it goes down and rollout completes or I need to promote to full at the last step to avoid scaling max.

This is how my HPA looks during the last step:

image

during the scaling up at the last step this is what my HPA says

Scaling config:

    - resource:
        name: memory
        target:
          averageUtilization: 85
          type: Utilization

status:


status:
  conditions:
    - lastTransitionTime: '2023-09-14T13:36:39Z'
      message: >-
        recent recommendations were higher than current one, applying the
        highest recent recommendation
      reason: ScaleDownStabilized
      status: 'True'
      type: AbleToScale
    - lastTransitionTime: '2024-01-02T15:34:09Z'
      message: >-
        the HPA was able to successfully calculate a replica count from memory
        resource utilization (percentage of request)
      reason: ValidMetricFound
      status: 'True'
      type: ScalingActive
    - lastTransitionTime: '2024-01-03T11:00:26Z'
      message: the desired count is within the acceptable range
      reason: DesiredWithinRange
      status: 'False'
      type: ScalingLimited
  currentMetrics:
    - external:
        current:
          averageValue: 67m
          value: '0'
        metric:
          name: s1-cron-...
          selector:
            matchLabels:
              scaledobject.keda.sh/name: xxx
      type: External
    - resource:
        current:
          averageUtilization: 46
          averageValue: 244952268800m
        name: memory
      type: Resource
    - resource:
        current:
          averageUtilization: 3
          averageValue: 17m
        name: cpu
      type: Resource
  currentReplicas: 15
  desiredReplicas: 15
  lastScaleTime: '2024-01-03T12:33:13Z'

Right after the rollout completed the HPA starts scaling the replicaset down

image

hasan-tayyar-besik avatar Jan 03 '24 12:01 hasan-tayyar-besik

I have upgraded my Argo Rollout helm chart from 2.32.2 to 2.34.0 and app version from 1.6.2 to 1.6.4. The issue persists.

hasan-tayyar-besik avatar Jan 03 '24 13:01 hasan-tayyar-besik

After deleting all my gatekeeper policies I get the same issue.

In the Argo rollouts logs this was a bit confusing to me

2024-01-03T16:10:10+01:00	time="2024-01-03T15:10:10Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"availableReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:10:08Z\",\"lastUpdateTime\":\"2024-01-03T15:10:08Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:10:10Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"Rollout is healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"True\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:10:10Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" has successfully progressed.\",\"reason\":\"NewReplicaSetAvailable\",\"status\":\"True\",\"type\":\"Progressing\"}],\"readyReplicas\":11,\"replicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751073974 rollout=app-name

...
2024-01-03T16:09:47+01:00	time="2024-01-03T15:09:47Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":13}}" generation=1065 namespace=my-namespace resourceVersion=751073492 rollout=app-name

...
2024-01-03T16:09:37+01:00	time="2024-01-03T15:09:37Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"availableReplicas\":12,\"readyReplicas\":12,\"replicas\":14}}" generation=1065 namespace=my-namespace resourceVersion=751073328 rollout=app-name

...
2024-01-03T16:09:12+01:00	time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":20,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:09:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":20,\"updatedReplicas\":11}}" generation=1065 namespace=my-namespace resourceVersion=751072932 rollout=app-name

...
2024-01-03T16:09:12+01:00	time="2024-01-03T15:09:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":19,\"replicas\":19}}" generation=1065 namespace=my-namespace resourceVersion=751072919 rollout=app-name

...
2024-01-03T16:08:42+01:00	time="2024-01-03T15:08:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":18,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":18,\"updatedReplicas\":10}}" generation=1064 namespace=my-namespace resourceVersion=751072501 rollout=app-name

...
2024-01-03T16:08:12+01:00	time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":17,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:08:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":17,\"updatedReplicas\":9}}" generation=1063 namespace=my-namespace resourceVersion=751072060 rollout=app-name

...
2024-01-03T16:08:12+01:00	time="2024-01-03T15:08:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":16,\"replicas\":16}}" generation=1063 namespace=my-namespace resourceVersion=751072045 rollout=app-name

...
2024-01-03T16:07:42+01:00	time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":15,\"message\":\"waiting for all steps to complete\",\"replicas\":15,\"updatedReplicas\":8}}" generation=1062 namespace=my-namespace resourceVersion=751071598 rollout=app-name

...
2024-01-03T16:07:42+01:00	time="2024-01-03T15:07:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":14,\"replicas\":14}}" generation=1062 namespace=my-namespace resourceVersion=751071572 rollout=app-name

...
2024-01-03T16:07:12+01:00	time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":13,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:58Z\",\"lastUpdateTime\":\"2024-01-03T15:06:58Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:07:12Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"waiting for all steps to complete\",\"replicas\":13,\"updatedReplicas\":7}}" generation=1061 namespace=my-namespace resourceVersion=751071170 rollout=app-name

...
2024-01-03T16:07:12+01:00	time="2024-01-03T15:07:12Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":12,\"replicas\":12}}" generation=1061 namespace=my-namespace resourceVersion=751071151 rollout=app-name

...
2024-01-03T16:06:42+01:00	time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":11,\"conditions\":[{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-01-03T15:02:19Z\",\"lastUpdateTime\":\"2024-01-03T15:02:19Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:02Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-01-03T15:06:42Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-01-03T15:06:02Z\",\"lastUpdateTime\":\"2024-01-03T15:06:42Z\",\"message\":\"ReplicaSet \\\"app-name-123450\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"message\":\"updated replicas are still becoming available\",\"replicas\":11,\"updatedReplicas\":6}}" generation=1060 namespace=my-namespace resourceVersion=751070699 rollout=app-name

...
2024-01-03T16:06:42+01:00	time="2024-01-03T15:06:42Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":10,\"replicas\":10}}" generation=1060 namespace=my-namespace resourceVersion=751070688 rollout=app-name

Logs as a whole

Explore-logs-2024-01-03 16_12_34.txt

hasan-tayyar-besik avatar Jan 03 '24 15:01 hasan-tayyar-besik

I deleted all OPA Gatekeeper mutations which update the objects. Now we only have verifications. Keda is still in place and the behaviour is the same.

hasan-tayyar-besik avatar Jan 08 '24 16:01 hasan-tayyar-besik

I am facing the same issue with:

  • Argo Rollouts: v1.6.0
  • ArgoCD: v2.7.2
  • KEDA: 2.11.2
  Normal  SuccessfulRescale  15m    horizontal-pod-autoscaler  New size: 18; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  7m35s  horizontal-pod-autoscaler  New size: 20; reason: memory resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  91s    horizontal-pod-autoscaler  New size: 22; reason: memory resource utilization (percentage of request) above target

@markh123 - were you able to figure out any workaround for this issue?

hansel-christopher1 avatar May 02 '24 03:05 hansel-christopher1

We're having the exact same issue. For a period of time, when the Rollout starts with memory autoscaling configured, the HPA sends an event saying New size: XX, reason: memory resource utilization (percentage of request) above target

is there any workaround for this?

atmcarmo avatar Jul 03 '24 16:07 atmcarmo

I'm still facing this issue on v1.6.6

laivu266 avatar Jul 04 '24 09:07 laivu266

We are facing the same issue here. Are we sure this is an argo-rollouts issue and not a Keda issue? Should this issue be opened also on Keda's side? What can we do to help debug this problem?

diogofilipe098 avatar Aug 08 '24 14:08 diogofilipe098

Hello, I think that KEDA isn't related at all as KEDA only exposes the metric to the HPA controller and the HPA controller operates over /scale resource (and the original message uses CPU and memory, which aren't related with KEDA). IMHO the issue is related with rollouts controller as it's the responsible for updating the underlying replicasets.

JorTurFer avatar Aug 12 '24 15:08 JorTurFer