argo-rollouts
argo-rollouts copied to clipboard
HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout
Checklist:
- [x] I've included steps to reproduce the bug.
- [x] I've included the version of argo rollouts.
Describe the bug
I’ve noticed a bug in the Rollout
behaviour when these specific conditions met:
- old revision is still active (in scale down delay window)
- HPA is changed for the rollout (ex: scale to 10 to 20 pods)
- canary rollout is triggered
It seems like the
HPA
only scales the stable replica set and not the old revision. When a new canary rollout is triggered, this causes it to be in a perpetualProgressing
state with the message"more replicas need to be updated"
. Looking at the code, it seems to be because theUpdatedReplicas
doesn’t match thespec.replicas
To Reproduce
- Create a
Rollout
with a scale down delay and an attachedHorizontalPodAutoscaler
:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: app
spec:
revisionHistoryLimit: 1
rollbackWindow:
revisions: 1
strategy:
canary:
scaleDownDelaySeconds: 3600 # 1 hours
scaleDownDelayRevisionLimit: 1
...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app
spec:
maxReplicas: 5
metrics:
- resource:
name: cpu
target:
averageUtilization: 80
type: Utilization
type: Resource
minReplicas: 1
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: app
- Run a canary rollout to completion in order to have a running old revision within the
scaleDownDelaySeconds
window - Trigger a replica count change with the
HorizontalPodAutoscaler
. For example, change theminReplicas
to 2. Notice the HPA only affects the latest revision of theReplicaSet
and not the previous - Trigger another canary rollout
- Notice the
Rollout
is in perpetualProgressing
state with the message"more replicas need to be updated"
Expected behavior
I think the HorizontalPodAutoscaler
should scale all ReplicaSets
so they are in sync and if ever a rollback if performed, the previous revision will be able to handle the load. If that's not possible then it should at least not block the rollout progression.
Screenshots
Version
v1.7.2+59e5bd3
Logs
None for the moment. Will try to reproduce and post them.
# Paste the logs from the rollout controller
# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts
# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.