argo-rollouts HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout

HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout

Open yohanb opened this issue 5 months ago • 0 comments

Checklist:

[x] I've included steps to reproduce the bug.
[x] I've included the version of argo rollouts.

Describe the bug

I’ve noticed a bug in the Rollout behaviour when these specific conditions met:

old revision is still active (in scale down delay window)
HPA is changed for the rollout (ex: scale to 10 to 20 pods)
canary rollout is triggered It seems like the HPA only scales the stable replica set and not the old revision. When a new canary rollout is triggered, this causes it to be in a perpetual Progressing state with the message "more replicas need to be updated". Looking at the code, it seems to be because the UpdatedReplicas doesn’t match the spec.replicas

To Reproduce

Create a Rollout with a scale down delay and an attached HorizontalPodAutoscaler:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  revisionHistoryLimit: 1
  rollbackWindow:
    revisions: 1
  strategy:
    canary:
      scaleDownDelaySeconds: 3600 # 1 hours
      scaleDownDelayRevisionLimit: 1
...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app
spec:
  maxReplicas: 5
  metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: app

Run a canary rollout to completion in order to have a running old revision within the scaleDownDelaySeconds window
Trigger a replica count change with the HorizontalPodAutoscaler. For example, change the minReplicas to 2. Notice the HPA only affects the latest revision of the ReplicaSet and not the previous
Trigger another canary rollout
Notice the Rollout is in perpetual Progressing state with the message "more replicas need to be updated"

Expected behavior

I think the HorizontalPodAutoscaler should scale all ReplicaSets so they are in sync and if ever a rollback if performed, the previous revision will be able to handle the load. If that's not possible then it should at least not block the rollout progression.

Screenshots

Version

v1.7.2+59e5bd3

Logs

None for the moment. Will try to reproduce and post them.

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Sep 24 '24 14:09 yohanb

argo-rollouts argo-rollouts copied to clipboard

HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout

argo-rollouts
argo-rollouts copied to clipboard