argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

After abort, rollout does not update Istio DestinationRule

Open mksha opened this issue 3 years ago β€’ 42 comments

Summary

What happened/what you expected to happen?

When we abort the rollout then it should update the Istio destination rule so in destinationrule we have subsets with correct rollouts-pod-template-hash label.

But currently when we abort the rollout it does not update the destination rule, causing cannery endpoint to fail.

Diagnostics

What version of Argo Rollouts are you running? 0.3.0

# Paste the logs from the rollout controller
canary subset is not updated with correct rollouts-pod-template-hash label value.

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME>
logs does not have any reference for destinationrule update after abort action

Message from the maintainers:

Impacted by this bug? Give it a πŸ‘. We prioritize the issues with the most πŸ‘.

mksha avatar Feb 02 '22 13:02 mksha

Hi,@mksha, could you confirm if this happens to v0.3.0? Have you tried with v1.1.1 or the latest argo-rollouts image?

huikang avatar Feb 02 '22 14:02 huikang

yes, we are using 1.1.1

mksha avatar Feb 02 '22 14:02 mksha

@mksha , could you paste here the output from get rollout <rollout name> and get desinationrule <name> -o yaml? Thanks.

huikang avatar Feb 02 '22 15:02 huikang

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:
    region: us-east-1
    system_risk_class: "1"
  managedFields:
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:rollout.argoproj.io/revision: {}
      f:spec:
        f:replicas: {}
      f:status:
        .: {}
        f:HPAReplicas: {}
        f:availableReplicas: {}
        f:blueGreen: {}
        f:canary:
          .: {}
          f:weights:
            .: {}
            f:canary:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
            f:stable:
              .: {}
              f:podTemplateHash: {}
              f:weight: {}
        f:conditions: {}
        f:currentPodHash: {}
        f:currentStepHash: {}
        f:currentStepIndex: {}
        f:observedGeneration: {}
        f:phase: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:selector: {}
        f:stableRS: {}
        f:updatedReplicas: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:minReadySeconds: {}
        f:paused: {}
        f:progressDeadlineSeconds: {}
        f:revisionHistoryLimit: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app_component: {}
            f:app_environment: {}
            f:application: {}
            f:business_unit: {}
            f:function: {}
            f:network_environment: {}
            f:region: {}
        f:strategy:
          .: {}
          f:canary:
            .: {}
            f:antiAffinity:
              .: {}
              f:preferredDuringSchedulingIgnoredDuringExecution:
                .: {}
                f:weight: {}
            f:canaryMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:stableMetadata:
              .: {}
              f:annotations:
                .: {}
                f:role: {}
              f:labels:
                .: {}
                f:role: {}
            f:steps: {}
            f:trafficRouting:
              .: {}
              f:istio:
                .: {}
                f:destinationRule:
                  .: {}
                  f:canarySubsetName: {}
                  f:name: {}
                  f:stableSubsetName: {}
                f:virtualServices: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:labels:
              .: {}
              f:app.kubernetes.io/component: {}
              f:app.kubernetes.io/instance: {}
              f:app.kubernetes.io/managed-by: {}
              f:app.kubernetes.io/name: {}
              f:app.kubernetes.io/version: {}
              f:app_component: {}
              f:app_contacts: {}
              f:app_environment: {}
              f:application: {}
              f:business_unit: {}
              f:created_by: {}
              f:function: {}
              f:helm.sh/chart: {}
              f:network_environment: {}
              f:region: {}
              f:system_risk_class: {}
          f:spec:
            .: {}
            f:affinity:
              .: {}
              f:nodeAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
              f:podAntiAffinity:
                .: {}
                f:preferredDuringSchedulingIgnoredDuringExecution: {}
            f:containers: {}
            f:imagePullSecrets: {}
            f:serviceAccountName: {}
            f:volumes: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:49Z"
  name: testapp
  namespace: test-ns
spec:
  minReadySeconds: 60
  paused: false
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: bu
      function: appserver
      network_environment: dev
      region: us-east-1
  strategy:
    canary:
      antiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          weight: 1
      canaryMetadata:
        annotations:
          role: canary
        labels:
          role: canary
      stableMetadata:
        annotations:
          role: stable
        labels:
          role: stable
      steps:
      - setWeight: 10
      - pause: {}
      - setWeight: 50
      - pause: {}
      - setWeight: 100
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: testapp
            stableSubsetName: stable
          virtualServices:
          - name: testapp-internal
            routes:
            - NewSessionDoc
            - NewSessionDocInternal
          - name: testapp-external
            routes:
            - NewSessionDoc
  template:
    metadata:
      labels:
        app.kubernetes.io/component: testapp
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: testapp
        app.kubernetes.io/version: 09c502fab0a1043740c21566dec377ce9f17e3cb-9
        app_component: testapp
        app_environment: dev
        application: testapp
        business_unit: bu
        function: appserver
        helm.sh/chart: webapp-5.0.1-alpha.9
        network_environment: dev
        region: us-east-1
        system_risk_class: "1"
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: workload
                operator: In
                values:
                - test
            weight: 50
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/test
                operator: In
                values:
                - ""
            weight: 1
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: topology.kubernetes.io/zone
            weight: 51
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - testapp
                - key: app.kubernetes.io/version
                  operator: In
                  values:
                  - 09c502fab0a1043740c21566dec377ce9f17e3cb-9
              topologyKey: kubernetes.io/hostname
            weight: 50
      containers:
      - env:
        - name: JAVA_MS
          value: -Xms1g
        - name: JAVA_MX
          value: -Xmx1g
        - name: PROFILE_ENV
          value: dev
        image: testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        name: testapp
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /keepalive.html
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 150
          successThreshold: 1
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 1
            memory: 1.5Gi
        volumeMounts:
        - mountPath: /web/company/docs/
          name: pv-volume
      imagePullSecrets:
      - name: credreg
      serviceAccountName: testapp
      volumes:
      - name: pv-volume
        nfs:
          path: /web/company/docs/
          server: company.com
status:
  HPAReplicas: 1
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100
  conditions:
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:08:35Z"
    message: Rollout is paused
    reason: RolloutPaused
    status: "False"
    type: Paused
  - lastTransitionTime: "2022-02-02T15:10:44Z"
    lastUpdateTime: "2022-02-02T15:10:44Z"
    message: Rollout has minimum availability
    reason: AvailableReason
    status: "True"
    type: Available
  - lastTransitionTime: "2022-02-02T15:13:19Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: RolloutCompleted
    reason: RolloutCompleted
    status: "True"
    type: Completed
  - lastTransitionTime: "2022-02-02T15:08:35Z"
    lastUpdateTime: "2022-02-02T15:13:19Z"
    message: ReplicaSet "testapp-649fdf8c7d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  currentPodHash: 649fdf8c7d
  currentStepHash: 5cd664bc9d
  currentStepIndex: 5
  observedGeneration: "115"
  phase: Healthy
  readyReplicas: 1
  replicas: 1
  selector: app.kubernetes.io/component=testapp,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=testapp,app_component=testapp,app_environment=dev,application=testapp,business_unit=bu,function=appserver,network_environment=dev,region=us-east-1
  stableRS: 649fdf8c7d
  updatedReplicas: 1

mksha avatar Feb 02 '22 15:02 mksha

@huikang

mksha avatar Feb 02 '22 15:02 mksha

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  creationTimestamp: "2021-11-29T13:48:24Z"
  generation: 147
  labels:
    app.kubernetes.io/component: testapp
  managedFields:
  - apiVersion: networking.istio.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/version: {}
          f:app_component: {}
          f:app_contacts: {}
          f:app_environment: {}
          f:application: {}
          f:argocd.argoproj.io/instance: {}
          f:business_unit: {}
          f:created_by: {}
          f:function: {}
          f:helm.sh/chart: {}
          f:network_environment: {}
          f:region: {}
          f:system_risk_class: {}
      f:spec:
        .: {}
        f:host: {}
        f:trafficPolicy:
          .: {}
          f:loadBalancer:
            .: {}
            f:consistentHash:
              .: {}
              f:httpCookie:
                .: {}
                f:name: {}
                f:path: {}
                f:ttl: {}
    manager: argocd-application-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  - apiVersion: networking.istio.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:argo-rollouts.argoproj.io/managed-by-rollouts: {}
      f:spec:
        f:subsets: {}
    manager: rollouts-controller
    operation: Update
    time: "2022-02-02T15:10:44Z"
  name: testapp
  namespace: default
  resourceVersion: "1013555439"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/destinationrules/testapp
  uid: 4d21d1d1-e039-415d-942a-b7ce99dbdd93
spec:
  host: testapp
  subsets:
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 56b5f6b98d
    name: canary
  - labels:
      app.kubernetes.io/component: testapp
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: testapp
      app_component: testapp
      app_environment: dev
      application: testapp
      business_unit: pdo
      function: appserver
      network_environment: dev
      region: us-east-1
      rollouts-pod-template-hash: 649fdf8c7d
    name: stable
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: testapp
          path: /
          ttl: 1m0s

mksha avatar Feb 02 '22 15:02 mksha

@huikang let me know if you need more details. thanks

mksha avatar Feb 02 '22 15:02 mksha

Hi, @mksha , thanks for the detailed info. It would be even better if you can provide the output of kubectl argo rollouts get rollout <rollout name> Thanks.

huikang avatar Feb 02 '22 16:02 huikang

@jgwest @alexec

mksha avatar Feb 02 '22 16:02 mksha

Also how about the status of the virtual service testapp-external?

huikang avatar Feb 02 '22 16:02 huikang

I couldn't reproduce the error with the latest master branch: after abort an update, the virtualservice is set with 100 weght to the stable subset

http:
  - name: primary
    route:
    - destination:
        host: istio-rollout
        subset: stable
      weight: 100
    - destination:
        host: istio-rollout
        subset: canary
      weight: 0

The destination has the correct pod labels:

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable
# rollout status
  canary:
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

And the stable pod is still running

get replicaset
NAME                      DESIRED   CURRENT   READY   AGE
istio-rollout-8c48ddb95   0         0         0       7m20s
istio-rollout-f4dcdc666   1         1         1       7m36s

huikang avatar Feb 02 '22 16:02 huikang

@mksha , in your case, I noticed the following in the rollout status

  canary:
    weights:
      canary:
        podTemplateHash: 649fdf8c7d
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

The canary pod is supposed to have a different hash value. I remember there was some bug related to this fixed after v1.1.1. Could you try using the latest version?

huikang avatar Feb 02 '22 16:02 huikang

Name:            testapp
Namespace:       default
Status:          βœ– Degraded
Message:         RolloutAborted: Rollout aborted update to revision 11
Strategy:        Canary
  Step:          0/5
  SetWeight:     0
  ActualWeight:  0
Images:          testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9 (stable)
Replicas:
  Desired:       1
  Current:       1
  Updated:       0
  Ready:         1
  Available:     1

NAME                                   KIND        STATUS        AGE    INFO
⟳ gdocument                            Rollout     βœ– Degraded    65d    
β”œβ”€β”€# revision:11                                                        
β”‚  └──⧉testapp-75dfb84574           ReplicaSet  β€’ ScaledDown  27m    canary,delay:passed
β”œβ”€β”€# revision:10                                                        
β”‚  └──⧉testapp-649fdf8c7d           ReplicaSet  βœ” Healthy     4h3m   stable
β”‚     └──░testapp-649fdf8c7d-wf79v  Pod         βœ” Running     107m   ready:2/2
└──# revision:9                                                         
   └──⧉testapp-56b5f6b98d           ReplicaSet  β€’ ScaledDown  4h13m  

mksha avatar Feb 02 '22 16:02 mksha

@huikang there you go

mksha avatar Feb 02 '22 16:02 mksha

after the Aborting rollout

mksha avatar Feb 02 '22 16:02 mksha

@mksha , could you try the latest image? I think the issue has been fixed there.

huikang avatar Feb 02 '22 19:02 huikang

HI @huikang , after the abort in destinationrule both subsets have same rollouts-pod-template-hash: f4dcdc666, bec canary pod is not there to server the canary traffic, so if destination rules does not point to stable for both stable and canary then canary requests will fail.

mksha avatar Feb 03 '22 09:02 mksha

@huikang do you mean trying image created from master branch ? bec I am already using v1.1.1

mksha avatar Feb 03 '22 09:02 mksha

canary:
    weights:
      canary:
        podTemplateHash: 5865cf698b
        weight: 0
      stable:
        podTemplateHash: 649fdf8c7d
        weight: 100

mksha avatar Feb 03 '22 09:02 mksha

from the rollout that was aborted using image v1.1.1

mksha avatar Feb 03 '22 09:02 mksha

But still destination rule is not updated , after canary pod deletion.

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  annotations:
    argo-rollouts.argoproj.io/managed-by-rollouts: testapp
  name: testapp
spec:
  host: testapp
  subsets:
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 5865cf698b
      name: canary
    - labels:
        app.kubernetes.io/component: testapp
        rollouts-pod-template-hash: 649fdf8c7d
      name: stable

mksha avatar Feb 03 '22 09:02 mksha

@huikang after the abort both subsets should point to stable hash

mksha avatar Feb 03 '22 09:02 mksha

or the pod hash label should be removed completely after abort

mksha avatar Feb 03 '22 09:02 mksha

I just figured out that Before rollout starting and after rollout completed both canary and stable subsets should point to stable pod hash template

example

weights:
      canary:
        podTemplateHash: 74c7887f8
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

But in our case after abort, its

canary:
    weights:
      canary:
        podTemplateHash: 6d947796cd
        weight: 0
      stable:
        podTemplateHash: 74c7887f8
        weight: 100

that is what is being copied to destination rule that is why its causing issues.

mksha avatar Feb 03 '22 11:02 mksha

Tested with the latest version and got the following after abort a canary update

# Rollout status
status:
  canary:
    currentBackgroundAnalysisRunStatus:
      message: Run Terminated
      name: istio-rollout-8c48ddb95-2
      status: Successful
    weights:
      canary:
        podTemplateHash: 8c48ddb95
        weight: 0
      stable:
        podTemplateHash: f4dcdc666
        weight: 100

The destinationrul matches the status :

spec:
  host: istio-rollout
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 8c48ddb95
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: f4dcdc666
    name: stable

huikang avatar Feb 03 '22 18:02 huikang

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Ideally when we do the canary deployment in production, then some amount of traffic will be going to canary version of app and we will be performing some analysis and if analysis failed, meaning rollout aborted in that case, we will have canary subset pointing to canary replica set that does not have any pod meaning all clients using canary version will see issues.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

mksha avatar Feb 04 '22 04:02 mksha

@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.

Although the canary of the rollout status points to the replicaset with 0 pods, the virtualservice have 0 weight to the canary rs; so no traffic will be sent to the replicaset.

So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.

I agree that the status should reset to the status where canary and stable points to the stable replicaset after update is aborted.

huikang avatar Feb 04 '22 04:02 huikang

even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.

Apart from that, let me know if i can help to fix this bug, i already found the place when we are updating the hash but not able to find why its diff that full promotion. https://github.com/argoproj/argo-rollouts/blob/5f0f8b4558467f5a1a09e7d08057ccf9973c01c7/rollout/trafficrouting.go#L130-L179

if you take a look at above code in all case it calls updatehash, but we need to find a place where its session canaryHash to stableHash in case of full promotion.

mksha avatar Feb 04 '22 06:02 mksha

@huikang

mksha avatar Feb 04 '22 11:02 mksha

I think the root cause is https://github.com/argoproj/argo-rollouts/blob/18677425dc2447a3c43683b1bdac3d031e4de6d6/rollout/trafficrouting.go#L126-L128

Even after the update is aborted, c.newRS still points to the 0-replica rs, which leaves the newRS hash in the status.canary.weights.canary. Maybe we can set newRS to stable RS if newRS is being deleted at https://github.com/argoproj/argo-rollouts/blob/0309eb6b4a2176995be1842a0f8d6bfb7f6d9943/rollout/replicaset.go#L144-L147

I might miss some cases. @jessesuen , could you provide any guidance?

huikang avatar Feb 04 '22 17:02 huikang