argo-rollouts ProgressDeadlineExceeded: ReplicaSet has timed out progressing

Checklist:

[X] I've included steps to reproduce the bug.
[X] I've included the version of argo rollouts.

Describe the bug

We have a number of rollouts that have analysist templates attached. The rollout shows that the replicas set has timed out but the replicas set is showing that it's fine. Restarting the replicas set seems to fix the problem. We also used to have 5 revisionHistoryLimit and found that removing one of the unused replicasets would sometimes get the rollout to see that everything was fine.

To Reproduce

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: XXXXX
  namespace: XXXXX
spec:
  progressDeadlineAbort: true
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app: XXXXX
  strategy:
    canary:
      analysis:
        analysisRunMetadata: {}
        args:
          - name: service-name
            value: XXXXX-service-v1
        startingStep: 1
        templates:
          - templateName: XXXXX
      canaryMetadata:
        labels:
          rollout-status: canary
      canaryService: XXXXX-canary
      maxSurge: 25%
      maxUnavailable: 0
      stableMetadata:
        labels:
          rollout-status: stable
      stableService: XXXXXX-service
      steps:
        - setWeight: 10
        - pause: {}
      trafficRouting:
        nginx:
          additionalIngressAnnotations:
            canary-by-header: X-Canary
            canary-by-header-value: true
          stableIngress: XXXXXX
  template:
    spec:
      containers:
        - image: 'gcr.io/XXXXXXXXX'
          lifecycle:
            preStop:
              exec:
                command:
                  - sleep
                  - '20'
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-liveness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: XXXXX
          ports:
            - containerPort: 8080
              name: http-server
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health-readiness
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              memory: 2Gi
            requests:
              memory: 1Gi

Expected behavior

Rollout to reflect the replicaset's current state

Screenshots

From rollout:

  conditions:
    - lastTransitionTime: '2024-03-19T04:12:57Z'
      lastUpdateTime: '2024-03-19T04:12:57Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-03-19T15:44:55Z'
      lastUpdateTime: '2024-03-19T15:44:55Z'
      message: Rollout is not healthy
      reason: RolloutHealthy
      status: 'False'
      type: Healthy
    - lastTransitionTime: '2024-03-21T00:07:38Z'
      lastUpdateTime: '2024-03-21T00:07:38Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2024-03-21T00:07:59Z'
      lastUpdateTime: '2024-03-21T00:07:59Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2024-03-21T00:18:30Z'
      lastUpdateTime: '2024-03-21T00:18:30Z'
      message: ReplicaSet "XXXXX-6645b789db" has timed out progressing.
      reason: ProgressDeadlineExceeded
      status: 'False'
      type: Progressing

From Replicaset

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  annotations:
    rollout.argoproj.io/desired-replicas: '2'
    rollout.argoproj.io/ephemeral-metadata: >-
....
  name: XXXXX-6645b789db
  namespace: XXXXX
....
status:
  availableReplicas: 2
  fullyLabeledReplicas: 2
  observedGeneration: 4
  readyReplicas: 2
  replicas: 2

Version

image: 'quay.io/argoproj/argo-rollouts:v1.6.6'

Logs

time=\"2024-03-21T00:30:26Z\" level=warning msg=\"ReplicaSet \\\"XXXXX-6645b789db\\\" has timed out progressing.\" event_reason=RolloutAborted namespace=XXX rollout=XXXXX

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Mar 21 '24 00:03 dodwmd

We have seen this happen with blue green strategy (no analysis templates were used).

"time="2024-05-01T04:09:27Z" level=error msg="rollout syncHandler error: failed to reconcileBlueGreenReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileBlueGreenStableReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset xxx-xxx: Operation cannot be fulfilled on replicasets.apps \"xxx-xxx\": the object has been modified; please apply your changes to the latest version and try again" namespace=xxx-xxx rollout=xxx-xxx",

"time="2024-05-01T04:09:27Z" level=info msg="rollout syncHandler queue retries: 40 : key \"xxx-xxx/xxxx\"" namespace=xxx-xxx rollout=xxx-xxx",

It seems like the informer cache is not updated soon enough. This happens only for specific rollout resources.

May 13 '24 07:05 vadasambar

We are also seeing this with a blue/green Rollout with AnalysisTemplate set. Restarting resolves the issue, but it resurfaces at irregular intervals. I have the slight assumption that this always happens when the Rollout did progress to all pods being ready fine at some point, but then some pods go killed (due to being evicted or due to an imminent node shutdown -- we are using spot nodes) and that killing may have happened after the progress deadline duration of the Rollout.

Jul 01 '24 21:07 kaiburjack

argo-rollouts argo-rollouts copied to clipboard

ProgressDeadlineExceeded: ReplicaSet has timed out progressing

argo-rollouts
argo-rollouts copied to clipboard