argo-rollouts
argo-rollouts copied to clipboard
ProgressDeadlineExceeded: ReplicaSet has timed out progressing
Checklist:
- [X] I've included steps to reproduce the bug.
- [X] I've included the version of argo rollouts.
Describe the bug
We have a number of rollouts that have analysist templates attached. The rollout shows that the replicas set has timed out but the replicas set is showing that it's fine. Restarting the replicas set seems to fix the problem. We also used to have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.
To Reproduce
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: XXXXX
namespace: XXXXX
spec:
progressDeadlineAbort: true
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 0
selector:
matchLabels:
app: XXXXX
strategy:
canary:
analysis:
analysisRunMetadata: {}
args:
- name: service-name
value: XXXXX-service-v1
startingStep: 1
templates:
- templateName: XXXXX
canaryMetadata:
labels:
rollout-status: canary
canaryService: XXXXX-canary
maxSurge: 25%
maxUnavailable: 0
stableMetadata:
labels:
rollout-status: stable
stableService: XXXXXX-service
steps:
- setWeight: 10
- pause: {}
trafficRouting:
nginx:
additionalIngressAnnotations:
canary-by-header: X-Canary
canary-by-header-value: true
stableIngress: XXXXXX
template:
spec:
containers:
- image: 'gcr.io/XXXXXXXXX'
lifecycle:
preStop:
exec:
command:
- sleep
- '20'
livenessProbe:
failureThreshold: 3
httpGet:
path: /health-liveness
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: XXXXX
ports:
- containerPort: 8080
name: http-server
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health-readiness
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
memory: 2Gi
requests:
memory: 1Gi
Expected behavior
Rollout to reflect the replicaset's current state
Screenshots
From rollout:
conditions:
- lastTransitionTime: '2024-03-19T04:12:57Z'
lastUpdateTime: '2024-03-19T04:12:57Z'
message: Rollout has minimum availability
reason: AvailableReason
status: 'True'
type: Available
- lastTransitionTime: '2024-03-19T15:44:55Z'
lastUpdateTime: '2024-03-19T15:44:55Z'
message: Rollout is not healthy
reason: RolloutHealthy
status: 'False'
type: Healthy
- lastTransitionTime: '2024-03-21T00:07:38Z'
lastUpdateTime: '2024-03-21T00:07:38Z'
message: Rollout is paused
reason: RolloutPaused
status: 'False'
type: Paused
- lastTransitionTime: '2024-03-21T00:07:59Z'
lastUpdateTime: '2024-03-21T00:07:59Z'
message: RolloutCompleted
reason: RolloutCompleted
status: 'True'
type: Completed
- lastTransitionTime: '2024-03-21T00:18:30Z'
lastUpdateTime: '2024-03-21T00:18:30Z'
message: ReplicaSet "XXXXX-6645b789db" has timed out progressing.
reason: ProgressDeadlineExceeded
status: 'False'
type: Progressing
From Replicaset
apiVersion: apps/v1
kind: ReplicaSet
metadata:
annotations:
rollout.argoproj.io/desired-replicas: '2'
rollout.argoproj.io/ephemeral-metadata: >-
....
name: XXXXX-6645b789db
namespace: XXXXX
....
status:
availableReplicas: 2
fullyLabeledReplicas: 2
observedGeneration: 4
readyReplicas: 2
replicas: 2
Version
image: 'quay.io/argoproj/argo-rollouts:v1.6.6'
Logs
time=\"2024-03-21T00:30:26Z\" level=warning msg=\"ReplicaSet \\\"XXXXX-6645b789db\\\" has timed out progressing.\" event_reason=RolloutAborted namespace=XXX rollout=XXXXX
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
We have seen this happen with blue green strategy (no analysis templates were used).
"time="2024-05-01T04:09:27Z" level=error msg="rollout syncHandler error: failed to reconcileBlueGreenReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileBlueGreenStableReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset xxx-xxx: Operation cannot be fulfilled on replicasets.apps \"xxx-xxx\": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx-xxx rollout=xxx-xxx",
"time="2024-05-01T04:09:27Z" level=info msg="rollout syncHandler queue retries: 40 : key \"xxx-xxx/xxxx\"" namespace=xxx-xxx rollout=xxx-xxx",
It seems like the informer cache is not updated soon enough. This happens only for specific rollout resources.
We are also seeing this with a blue/green Rollout with AnalysisTemplate set. Restarting resolves the issue, but it resurfaces at irregular intervals. I have the slight assumption that this always happens when the Rollout did progress to all pods being ready fine at some point, but then some pods go killed (due to being evicted or due to an imminent node shutdown -- we are using spot nodes) and that killing may have happened after the progress deadline duration of the Rollout.