When "Degraded" state, canary-service doesn't have any endpoints. That's why TraefikService can't route trafik to Stable ReplicaSet
Describe the bug
I found that in "Degraded" phase and after Rollback to previous version, selectors of canary-service not return to previous ReplicaSet Hash. And still remains in new ReplicaSet Hash. That's why canary-service doesn't have any endpoints. And TraefikService can't serve traffic to stable version. Because one of the services not valid.
To Reproduce
- Use Traefik provider for trafficRouting
- Make for you new version "Degraded" state of app after canary steps
- Try to send requests, or check you app when it's rolled back to previous stable version
Expected behavior Previous, stable version of app can receive requests. Selectors of canary-service.yml returned to previous hash of RS
Version latest 1.2.1 not have traefik support. So i am builded from last commits.
My yaml files
Rollout
strategy:
canary:
canaryService: colorapi-canary
stableService: colorapi-stable
trafficRouting:
traefik:
weightedTraefikServiceName: colorapi
analysis:
templates:
- templateName: web
steps:
- setWeight: 10
- pause:
duration: 50s
- setWeight: 50
- pause:
duration: 50s
TraefikService
apiVersion: traefik.containo.us/v1alpha1
kind: TraefikService
metadata:
name: colorapi
namespace: devops-services
spec:
weighted:
services:
- name: colorapi-canary
port: 5000
- name: colorapi-stable
port: 5000
IngressRoute
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
annotations:
kubernetes.io/ingress.class: traefik
creationTimestamp: null
name: colorapi
namespace: devops-services
spec:
entryPoints: []
routes:
- kind: Rule
match: Host("colorapi.test.local.com") && PathPrefix("/")
middlewares: []
priority: 0
services:
- kind: TraefikService
name: colorapi
After Degraded state, canary service don't have any enpoints: kubectl describe svc colorapi-canary.
Name: colorapi-canary
Namespace: devops-services
Labels: app=colorapi
app.kubernetes.io/instance=colorapi
Annotations: argo-rollouts.argoproj.io/managed-by-rollouts: colorapi
Selector: app=colorapi,rollouts-pod-template-hash=75f9cc4fb4
Type: ClusterIP
IP Families: <none>
IP: 192.168.18.218
IPs: <none>
Port: web 5000/TCP
TargetPort: 5000/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
colorapi-stable:
Name: colorapi-stable
Namespace: devops-services
Labels: app=colorapi
app.kubernetes.io/instance=colorapi
Annotations: argo-rollouts.argoproj.io/managed-by-rollouts: colorapi
Selector: app=colorapi,rollouts-pod-template-hash=5c559c684b
Type: ClusterIP
IP Families: <none>
IP: 192.168.29.202
IPs: <none>
Port: web 5000/TCP
TargetPort: 5000/TCP
Endpoints: 192.168.189.142:5000,192.168.223.178:5000
Session Affinity: None
Events: <none>
Stable service has endpoints. But it's not enough to receive requests to old version. I get 404 not found.
Logs from traefik:
{"level":"error","msg":"Error while building TraefikService: subset not found for devops-services/colorapi-canary","providerName":"kubernetescrd","serviceName":"colorapi","time":"2022-06-24T11:52:19Z"} {"level":"error","msg":"Error while building TraefikService: subset not found for devops-services/colorapi-canary","providerName":"kubernetescrd","serviceName":"colorapi","time":"2022-06-24T11:54:12Z"}
subset not found for devops-services/colorapi-canary <-- is raised because no endpoints are matching the devops-services/colorapi-canary. So that's why 404. Help pls!!
Hi @PhilippPlotnikov Could you please look on this issue?
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.
@PhilippPlotnikov help pls
This issue is stale because it has been open 60 days with no activity.
@PhilippPlotnikov any help on this?
This issue is stale because it has been open 60 days with no activity.
Hi 👋🏻
I just suffered the same problem that @ayanhamza described. Did someone find a solution for this? It seems pretty random. I'm using the latest version (v1.6.6).
@perenesenko @PhilippPlotnikov Sorry for the mention, but this seems quite weird!
I checked the logs of the argo-rollouts pods, and I found an infinite loop when rolling back the canary.
"2024-03-22T16:35:49.112Z","Started syncing rollout"
"2024-03-22T16:35:49.112Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.112Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.112Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.113Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.113Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:50.114Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.114Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.114Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"