argo-rollouts icon indicating copy to clipboard operation
argo-rollouts copied to clipboard

When "Degraded" state, canary-service doesn't have any endpoints. That's why TraefikService can't route trafik to Stable ReplicaSet

Open ayanhamza opened this issue 3 years ago • 8 comments

Describe the bug

I found that in "Degraded" phase and after Rollback to previous version, selectors of canary-service not return to previous ReplicaSet Hash. And still remains in new ReplicaSet Hash. That's why canary-service doesn't have any endpoints. And TraefikService can't serve traffic to stable version. Because one of the services not valid.

To Reproduce

  1. Use Traefik provider for trafficRouting
  2. Make for you new version "Degraded" state of app after canary steps
  3. Try to send requests, or check you app when it's rolled back to previous stable version

Expected behavior Previous, stable version of app can receive requests. Selectors of canary-service.yml returned to previous hash of RS

Version latest 1.2.1 not have traefik support. So i am builded from last commits.

My yaml files

Rollout

  strategy:
    canary:
      canaryService: colorapi-canary
      stableService: colorapi-stable
      trafficRouting:
        traefik:
          weightedTraefikServiceName: colorapi
      analysis:
        templates:
        - templateName: web
      steps:
      - setWeight: 10
      - pause:
          duration: 50s
      - setWeight: 50
      - pause:
          duration: 50s

TraefikService


apiVersion: traefik.containo.us/v1alpha1
kind: TraefikService
metadata:
  name: colorapi
  namespace: devops-services
spec:
  weighted:
    services:
      - name: colorapi-canary
        port: 5000
      - name: colorapi-stable
        port: 5000

IngressRoute

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  annotations:
    kubernetes.io/ingress.class: traefik
  creationTimestamp: null
  name: colorapi
  namespace: devops-services
spec:
  entryPoints: []
  routes:
  - kind: Rule
    match: Host("colorapi.test.local.com") && PathPrefix("/")
    middlewares: []
    priority: 0
    services:
    - kind: TraefikService
      name: colorapi

After Degraded state, canary service don't have any enpoints: kubectl describe svc colorapi-canary.

Name:              colorapi-canary
Namespace:         devops-services
Labels:            app=colorapi
                   app.kubernetes.io/instance=colorapi
Annotations:       argo-rollouts.argoproj.io/managed-by-rollouts: colorapi
Selector:          app=colorapi,rollouts-pod-template-hash=75f9cc4fb4
Type:              ClusterIP
IP Families:       <none>
IP:                192.168.18.218
IPs:               <none>
Port:              web  5000/TCP
TargetPort:        5000/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

colorapi-stable:

Name:              colorapi-stable
Namespace:         devops-services
Labels:            app=colorapi
                   app.kubernetes.io/instance=colorapi
Annotations:       argo-rollouts.argoproj.io/managed-by-rollouts: colorapi
Selector:          app=colorapi,rollouts-pod-template-hash=5c559c684b
Type:              ClusterIP
IP Families:       <none>
IP:                192.168.29.202
IPs:               <none>
Port:              web  5000/TCP
TargetPort:        5000/TCP
Endpoints:         192.168.189.142:5000,192.168.223.178:5000
Session Affinity:  None
Events:            <none>

Stable service has endpoints. But it's not enough to receive requests to old version. I get 404 not found.

Logs from traefik:

{"level":"error","msg":"Error while building TraefikService: subset not found for devops-services/colorapi-canary","providerName":"kubernetescrd","serviceName":"colorapi","time":"2022-06-24T11:52:19Z"} {"level":"error","msg":"Error while building TraefikService: subset not found for devops-services/colorapi-canary","providerName":"kubernetescrd","serviceName":"colorapi","time":"2022-06-24T11:54:12Z"}

subset not found for devops-services/colorapi-canary <-- is raised because no endpoints are matching the devops-services/colorapi-canary. So that's why 404. Help pls!!

ayanhamza avatar Jun 28 '22 04:06 ayanhamza

Hi @PhilippPlotnikov Could you please look on this issue?

perenesenko avatar Jun 28 '22 18:06 perenesenko

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 17 '22 04:10 github-actions[bot]

@PhilippPlotnikov help pls

ayanhamza avatar Oct 17 '22 05:10 ayanhamza

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Dec 18 '22 02:12 github-actions[bot]

@PhilippPlotnikov any help on this?

idurgakalyan avatar Dec 23 '22 15:12 idurgakalyan

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Feb 22 '23 02:02 github-actions[bot]

Hi 👋🏻

I just suffered the same problem that @ayanhamza described. Did someone find a solution for this? It seems pretty random. I'm using the latest version (v1.6.6).

alopezsanchez avatar Mar 22 '24 19:03 alopezsanchez

@perenesenko @PhilippPlotnikov Sorry for the mention, but this seems quite weird! I checked the logs of the argo-rollouts pods, and I found an infinite loop when rolling back the canary.

"2024-03-22T16:35:49.112Z","Started syncing rollout"
"2024-03-22T16:35:49.112Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.112Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.112Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.113Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:49.113Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:49.113Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:49.113Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:49.113Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:49.113Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:49.113Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:49.113Z","No status changes. Skipping patch"
"2024-03-22T16:35:49.113Z","Reconciliation completed"
"2024-03-22T16:35:49.113Z","Started syncing rollout"
"2024-03-22T16:35:50.114Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.114Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.114Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Found 1 TrafficRouting Reconcilers"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '7b4766677b' to '75cbf669bb'"
"2024-03-22T16:35:50.115Z","Reconciling TrafficRouting with type 'Traefik'"
"2024-03-22T16:35:50.115Z","Event(v1.ObjectReference{Kind:""Rollout"", Namespace:""default"", Name:""my-service"", UID:""32819b8d-815c-45d1-96bc-5b6f1a4ca575"", APIVersion:""argoproj.io/v1alpha1"", ResourceVersion:""216543709"", FieldPath:""""}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Switched selector for service 'my-service-canary' from '75cbf669bb' to '7b4766677b'"
"2024-03-22T16:35:50.115Z","Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false"
"2024-03-22T16:35:50.115Z","Scale down new rs 'my-service-75cbf669bb' on abort (30s)"
"2024-03-22T16:35:50.115Z","New rs 'my-service-75cbf669bb' has scaledown deadline annotation: 2024-03-22T16:36:15Z"
"2024-03-22T16:35:50.115Z","RS 'my-service-75cbf669bb' has not reached the scaleDownTime"
"2024-03-22T16:35:50.115Z","No status changes. Skipping patch"
"2024-03-22T16:35:50.115Z","Reconciliation completed"
"2024-03-22T16:35:50.115Z","Started syncing rollout"

extract-2024-03-22T20_53_04.086Z.csv

alopezsanchez avatar Mar 22 '24 20:03 alopezsanchez