argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout

Open yohanb opened this issue 5 months ago • 0 comments

Checklist:

  • [x] I've included steps to reproduce the bug.
  • [x] I've included the version of argo rollouts.

Describe the bug

I’ve noticed a bug in the Rollout behaviour when these specific conditions met:

  • old revision is still active (in scale down delay window)
  • HPA is changed for the rollout (ex: scale to 10 to 20 pods)
  • canary rollout is triggered It seems like the HPA only scales the stable replica set and not the old revision. When a new canary rollout is triggered, this causes it to be in a perpetual Progressing state with the message "more replicas need to be updated". Looking at the code, it seems to be because the UpdatedReplicas doesn’t match the spec.replicas

To Reproduce

  1. Create a Rollout with a scale down delay and an attached HorizontalPodAutoscaler:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  revisionHistoryLimit: 1
  rollbackWindow:
    revisions: 1
  strategy:
    canary:
      scaleDownDelaySeconds: 3600 # 1 hours
      scaleDownDelayRevisionLimit: 1
...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app
spec:
  maxReplicas: 5
  metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: app
  1. Run a canary rollout to completion in order to have a running old revision within the scaleDownDelaySeconds window
  2. Trigger a replica count change with the HorizontalPodAutoscaler. For example, change the minReplicas to 2. Notice the HPA only affects the latest revision of the ReplicaSet and not the previous
  3. Trigger another canary rollout
  4. Notice the Rollout is in perpetual Progressing state with the message "more replicas need to be updated"

Expected behavior

I think the HorizontalPodAutoscaler should scale all ReplicaSets so they are in sync and if ever a rollback if performed, the previous revision will be able to handle the load. If that's not possible then it should at least not block the rollout progression.

Screenshots

image

Version

v1.7.2+59e5bd3

Logs

None for the moment. Will try to reproduce and post them.

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

yohanb avatar Sep 24 '24 14:09 yohanb