argo-rollouts
argo-rollouts copied to clipboard
After abort, rollout does not update Istio DestinationRule
Summary
What happened/what you expected to happen?
When we abort the rollout then it should update the Istio destination rule so in destinationrule we have subsets with correct rollouts-pod-template-hash label.
But currently when we abort the rollout it does not update the destination rule, causing cannery endpoint to fail.
Diagnostics
What version of Argo Rollouts are you running? 0.3.0
# Paste the logs from the rollout controller
canary subset is not updated with correct rollouts-pod-template-hash label value.
# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts
# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME>
logs does not have any reference for destinationrule update after abort action
Message from the maintainers:
Impacted by this bug? Give it a π. We prioritize the issues with the most π.
Hi,@mksha, could you confirm if this happens to v0.3.0? Have you tried with v1.1.1 or the latest argo-rollouts image?
yes, we are using 1.1.1
@mksha , could you paste here the output from get rollout <rollout name>
and get desinationrule <name> -o yaml
? Thanks.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
labels:
region: us-east-1
system_risk_class: "1"
managedFields:
- apiVersion: argoproj.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:rollout.argoproj.io/revision: {}
f:spec:
f:replicas: {}
f:status:
.: {}
f:HPAReplicas: {}
f:availableReplicas: {}
f:blueGreen: {}
f:canary:
.: {}
f:weights:
.: {}
f:canary:
.: {}
f:podTemplateHash: {}
f:weight: {}
f:stable:
.: {}
f:podTemplateHash: {}
f:weight: {}
f:conditions: {}
f:currentPodHash: {}
f:currentStepHash: {}
f:currentStepIndex: {}
f:observedGeneration: {}
f:phase: {}
f:readyReplicas: {}
f:replicas: {}
f:selector: {}
f:stableRS: {}
f:updatedReplicas: {}
manager: rollouts-controller
operation: Update
time: "2022-02-02T15:10:44Z"
- apiVersion: argoproj.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:labels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app.kubernetes.io/version: {}
f:app_component: {}
f:app_contacts: {}
f:app_environment: {}
f:application: {}
f:argocd.argoproj.io/instance: {}
f:business_unit: {}
f:created_by: {}
f:function: {}
f:helm.sh/chart: {}
f:network_environment: {}
f:region: {}
f:system_risk_class: {}
f:spec:
.: {}
f:minReadySeconds: {}
f:paused: {}
f:progressDeadlineSeconds: {}
f:revisionHistoryLimit: {}
f:selector:
.: {}
f:matchLabels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app_component: {}
f:app_environment: {}
f:application: {}
f:business_unit: {}
f:function: {}
f:network_environment: {}
f:region: {}
f:strategy:
.: {}
f:canary:
.: {}
f:antiAffinity:
.: {}
f:preferredDuringSchedulingIgnoredDuringExecution:
.: {}
f:weight: {}
f:canaryMetadata:
.: {}
f:annotations:
.: {}
f:role: {}
f:labels:
.: {}
f:role: {}
f:stableMetadata:
.: {}
f:annotations:
.: {}
f:role: {}
f:labels:
.: {}
f:role: {}
f:steps: {}
f:trafficRouting:
.: {}
f:istio:
.: {}
f:destinationRule:
.: {}
f:canarySubsetName: {}
f:name: {}
f:stableSubsetName: {}
f:virtualServices: {}
f:template:
.: {}
f:metadata:
.: {}
f:labels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app.kubernetes.io/version: {}
f:app_component: {}
f:app_contacts: {}
f:app_environment: {}
f:application: {}
f:business_unit: {}
f:created_by: {}
f:function: {}
f:helm.sh/chart: {}
f:network_environment: {}
f:region: {}
f:system_risk_class: {}
f:spec:
.: {}
f:affinity:
.: {}
f:nodeAffinity:
.: {}
f:preferredDuringSchedulingIgnoredDuringExecution: {}
f:podAntiAffinity:
.: {}
f:preferredDuringSchedulingIgnoredDuringExecution: {}
f:containers: {}
f:imagePullSecrets: {}
f:serviceAccountName: {}
f:volumes: {}
manager: argocd-application-controller
operation: Update
time: "2022-02-02T15:10:49Z"
name: testapp
namespace: test-ns
spec:
minReadySeconds: 60
paused: false
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 5
selector:
matchLabels:
app.kubernetes.io/component: testapp
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: testapp
app_component: testapp
app_environment: dev
application: testapp
business_unit: bu
function: appserver
network_environment: dev
region: us-east-1
strategy:
canary:
antiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
weight: 1
canaryMetadata:
annotations:
role: canary
labels:
role: canary
stableMetadata:
annotations:
role: stable
labels:
role: stable
steps:
- setWeight: 10
- pause: {}
- setWeight: 50
- pause: {}
- setWeight: 100
trafficRouting:
istio:
destinationRule:
canarySubsetName: canary
name: testapp
stableSubsetName: stable
virtualServices:
- name: testapp-internal
routes:
- NewSessionDoc
- NewSessionDocInternal
- name: testapp-external
routes:
- NewSessionDoc
template:
metadata:
labels:
app.kubernetes.io/component: testapp
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: testapp
app.kubernetes.io/version: 09c502fab0a1043740c21566dec377ce9f17e3cb-9
app_component: testapp
app_environment: dev
application: testapp
business_unit: bu
function: appserver
helm.sh/chart: webapp-5.0.1-alpha.9
network_environment: dev
region: us-east-1
system_risk_class: "1"
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: workload
operator: In
values:
- test
weight: 50
- preference:
matchExpressions:
- key: node-role.kubernetes.io/test
operator: In
values:
- ""
weight: 1
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- testapp
- key: app.kubernetes.io/version
operator: In
values:
- 09c502fab0a1043740c21566dec377ce9f17e3cb-9
topologyKey: topology.kubernetes.io/zone
weight: 51
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- testapp
- key: app.kubernetes.io/version
operator: In
values:
- 09c502fab0a1043740c21566dec377ce9f17e3cb-9
topologyKey: kubernetes.io/hostname
weight: 50
containers:
- env:
- name: JAVA_MS
value: -Xms1g
- name: JAVA_MX
value: -Xmx1g
- name: PROFILE_ENV
value: dev
image: testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9
imagePullPolicy: Always
livenessProbe:
failureThreshold: 6
httpGet:
path: /keepalive.html
port: 8080
scheme: HTTP
initialDelaySeconds: 150
successThreshold: 1
name: testapp
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 6
httpGet:
path: /keepalive.html
port: 8080
scheme: HTTP
initialDelaySeconds: 150
successThreshold: 1
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 1.5Gi
volumeMounts:
- mountPath: /web/company/docs/
name: pv-volume
imagePullSecrets:
- name: credreg
serviceAccountName: testapp
volumes:
- name: pv-volume
nfs:
path: /web/company/docs/
server: company.com
status:
HPAReplicas: 1
availableReplicas: 1
blueGreen: {}
canary:
weights:
canary:
podTemplateHash: 649fdf8c7d
weight: 0
stable:
podTemplateHash: 649fdf8c7d
weight: 100
conditions:
- lastTransitionTime: "2022-02-02T15:08:35Z"
lastUpdateTime: "2022-02-02T15:08:35Z"
message: Rollout is paused
reason: RolloutPaused
status: "False"
type: Paused
- lastTransitionTime: "2022-02-02T15:10:44Z"
lastUpdateTime: "2022-02-02T15:10:44Z"
message: Rollout has minimum availability
reason: AvailableReason
status: "True"
type: Available
- lastTransitionTime: "2022-02-02T15:13:19Z"
lastUpdateTime: "2022-02-02T15:13:19Z"
message: RolloutCompleted
reason: RolloutCompleted
status: "True"
type: Completed
- lastTransitionTime: "2022-02-02T15:08:35Z"
lastUpdateTime: "2022-02-02T15:13:19Z"
message: ReplicaSet "testapp-649fdf8c7d" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
currentPodHash: 649fdf8c7d
currentStepHash: 5cd664bc9d
currentStepIndex: 5
observedGeneration: "115"
phase: Healthy
readyReplicas: 1
replicas: 1
selector: app.kubernetes.io/component=testapp,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=testapp,app_component=testapp,app_environment=dev,application=testapp,business_unit=bu,function=appserver,network_environment=dev,region=us-east-1
stableRS: 649fdf8c7d
updatedReplicas: 1
@huikang
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
annotations:
argo-rollouts.argoproj.io/managed-by-rollouts: testapp
creationTimestamp: "2021-11-29T13:48:24Z"
generation: 147
labels:
app.kubernetes.io/component: testapp
managedFields:
- apiVersion: networking.istio.io/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:labels:
.: {}
f:app.kubernetes.io/component: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:app.kubernetes.io/version: {}
f:app_component: {}
f:app_contacts: {}
f:app_environment: {}
f:application: {}
f:argocd.argoproj.io/instance: {}
f:business_unit: {}
f:created_by: {}
f:function: {}
f:helm.sh/chart: {}
f:network_environment: {}
f:region: {}
f:system_risk_class: {}
f:spec:
.: {}
f:host: {}
f:trafficPolicy:
.: {}
f:loadBalancer:
.: {}
f:consistentHash:
.: {}
f:httpCookie:
.: {}
f:name: {}
f:path: {}
f:ttl: {}
manager: argocd-application-controller
operation: Update
time: "2022-02-02T15:10:44Z"
- apiVersion: networking.istio.io/v1alpha3
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:argo-rollouts.argoproj.io/managed-by-rollouts: {}
f:spec:
f:subsets: {}
manager: rollouts-controller
operation: Update
time: "2022-02-02T15:10:44Z"
name: testapp
namespace: default
resourceVersion: "1013555439"
selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/destinationrules/testapp
uid: 4d21d1d1-e039-415d-942a-b7ce99dbdd93
spec:
host: testapp
subsets:
- labels:
app.kubernetes.io/component: testapp
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: testapp
app_component: testapp
app_environment: dev
application: testapp
business_unit: pdo
function: appserver
network_environment: dev
region: us-east-1
rollouts-pod-template-hash: 56b5f6b98d
name: canary
- labels:
app.kubernetes.io/component: testapp
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: testapp
app_component: testapp
app_environment: dev
application: testapp
business_unit: pdo
function: appserver
network_environment: dev
region: us-east-1
rollouts-pod-template-hash: 649fdf8c7d
name: stable
trafficPolicy:
loadBalancer:
consistentHash:
httpCookie:
name: testapp
path: /
ttl: 1m0s
@huikang let me know if you need more details. thanks
Hi, @mksha , thanks for the detailed info. It would be even better if you can provide the output of kubectl argo rollouts get rollout <rollout name>
Thanks.
@jgwest @alexec
Also how about the status of the virtual service testapp-external
?
I couldn't reproduce the error with the latest master branch: after abort an update, the virtualservice is set with 100 weght to the stable subset
http:
- name: primary
route:
- destination:
host: istio-rollout
subset: stable
weight: 100
- destination:
host: istio-rollout
subset: canary
weight: 0
The destination has the correct pod labels:
spec:
host: istio-rollout
subsets:
- labels:
app: istio-rollout
rollouts-pod-template-hash: 8c48ddb95
name: canary
- labels:
app: istio-rollout
rollouts-pod-template-hash: f4dcdc666
name: stable
# rollout status
canary:
weights:
canary:
podTemplateHash: 8c48ddb95
weight: 0
stable:
podTemplateHash: f4dcdc666
weight: 100
And the stable pod is still running
get replicaset
NAME DESIRED CURRENT READY AGE
istio-rollout-8c48ddb95 0 0 0 7m20s
istio-rollout-f4dcdc666 1 1 1 7m36s
@mksha , in your case, I noticed the following in the rollout status
canary:
weights:
canary:
podTemplateHash: 649fdf8c7d
weight: 0
stable:
podTemplateHash: 649fdf8c7d
weight: 100
The canary pod is supposed to have a different hash value. I remember there was some bug related to this fixed after v1.1.1. Could you try using the latest version?
Name: testapp
Namespace: default
Status: β Degraded
Message: RolloutAborted: Rollout aborted update to revision 11
Strategy: Canary
Step: 0/5
SetWeight: 0
ActualWeight: 0
Images: testapp:09c502fab0a1043740c21566dec377ce9f17e3cb-9 (stable)
Replicas:
Desired: 1
Current: 1
Updated: 0
Ready: 1
Available: 1
NAME KIND STATUS AGE INFO
β³ gdocument Rollout β Degraded 65d
βββ# revision:11
β ββββ§testapp-75dfb84574 ReplicaSet β’ ScaledDown 27m canary,delay:passed
βββ# revision:10
β ββββ§testapp-649fdf8c7d ReplicaSet β Healthy 4h3m stable
β ββββ‘testapp-649fdf8c7d-wf79v Pod β Running 107m ready:2/2
βββ# revision:9
ββββ§testapp-56b5f6b98d ReplicaSet β’ ScaledDown 4h13m
@huikang there you go
after the Aborting rollout
@mksha , could you try the latest image? I think the issue has been fixed there.
HI @huikang , after the abort in destinationrule both subsets have same rollouts-pod-template-hash: f4dcdc666, bec canary pod is not there to server the canary traffic, so if destination rules does not point to stable for both stable and canary then canary requests will fail.
@huikang do you mean trying image created from master branch ? bec I am already using v1.1.1
canary:
weights:
canary:
podTemplateHash: 5865cf698b
weight: 0
stable:
podTemplateHash: 649fdf8c7d
weight: 100
from the rollout that was aborted using image v1.1.1
But still destination rule is not updated , after canary pod deletion.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
annotations:
argo-rollouts.argoproj.io/managed-by-rollouts: testapp
name: testapp
spec:
host: testapp
subsets:
- labels:
app.kubernetes.io/component: testapp
rollouts-pod-template-hash: 5865cf698b
name: canary
- labels:
app.kubernetes.io/component: testapp
rollouts-pod-template-hash: 649fdf8c7d
name: stable
@huikang after the abort both subsets should point to stable hash
or the pod hash label should be removed completely after abort
I just figured out that Before rollout starting and after rollout completed both canary and stable subsets should point to stable pod hash template
example
weights:
canary:
podTemplateHash: 74c7887f8
weight: 0
stable:
podTemplateHash: 74c7887f8
weight: 100
But in our case after abort, its
canary:
weights:
canary:
podTemplateHash: 6d947796cd
weight: 0
stable:
podTemplateHash: 74c7887f8
weight: 100
that is what is being copied to destination rule that is why its causing issues.
Tested with the latest version and got the following after abort a canary update
# Rollout status
status:
canary:
currentBackgroundAnalysisRunStatus:
message: Run Terminated
name: istio-rollout-8c48ddb95-2
status: Successful
weights:
canary:
podTemplateHash: 8c48ddb95
weight: 0
stable:
podTemplateHash: f4dcdc666
weight: 100
The destinationrul matches the status :
spec:
host: istio-rollout
subsets:
- labels:
app: istio-rollout
rollouts-pod-template-hash: 8c48ddb95
name: canary
- labels:
app: istio-rollout
rollouts-pod-template-hash: f4dcdc666
name: stable
@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.
Ideally when we do the canary deployment in production, then some amount of traffic will be going to canary version of app and we will be performing some analysis and if analysis failed, meaning rollout aborted in that case, we will have canary subset pointing to canary replica set that does not have any pod meaning all clients using canary version will see issues.
So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.
@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail.
Although the canary of the rollout status points to the replicaset with 0 pods, the virtualservice have 0 weight to the canary rs; so no traffic will be sent to the replicaset.
So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset.
I agree that the status should reset to the status where canary and stable points to the stable replicaset after update is aborted.
even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie.
Apart from that, let me know if i can help to fix this bug, i already found the place when we are updating the hash but not able to find why its diff that full promotion. https://github.com/argoproj/argo-rollouts/blob/5f0f8b4558467f5a1a09e7d08057ccf9973c01c7/rollout/trafficrouting.go#L130-L179
if you take a look at above code in all case it calls updatehash, but we need to find a place where its session canaryHash to stableHash in case of full promotion.
@huikang
I think the root cause is https://github.com/argoproj/argo-rollouts/blob/18677425dc2447a3c43683b1bdac3d031e4de6d6/rollout/trafficrouting.go#L126-L128
Even after the update is aborted, c.newRS
still points to the 0-replica rs, which leaves the newRS hash in the status.canary.weights.canary
.
Maybe we can set newRS
to stable RS if newRS is being deleted
at https://github.com/argoproj/argo-rollouts/blob/0309eb6b4a2176995be1842a0f8d6bfb7f6d9943/rollout/replicaset.go#L144-L147
I might miss some cases. @jessesuen , could you provide any guidance?