ingress-gce Ingress changes resulting in 502s

Ingress changes resulting in 502s

Open trevex opened this issue 1 year ago • 3 comments

I was recently made aware of the following behavior and am wondering if it is intentional:

When you create an Ingress-resource and one of the rules is changed in a way, that the service it points to is not used by the Ingress-resource it will result in 502s as the NEG is de-provisioned.

This can be easily reproduced on GKE Autopilot as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
      - name: whoami
        image: traefik/whoami
---
apiVersion: v1
kind: Service
metadata:
  name: whoami
spec:
  ports:
  - name: http
    targetPort: 80
    port: 80
  selector:
    app: whoami
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - name: http
    targetPort: 80
    port: 80
  selector:
    app: nginx
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test
spec:
  rules:
  - host: foo.mydomain.com
    http:
      paths:
      - backend:
          service:
            name: nginx
            port:
              number: 80
        path: /
        pathType: Prefix
  - host: bar.mydomain.com
    http:
      paths:
      - backend:
          service:
            name: whoami
            port:
              number: 80
        path: /
        pathType: Prefix

If we change the backend of foo.mydomain.com above to whoami, nginx will be unused and we will get 502s for about a minute. Unfortunately it is not possible to use a pre-existing NEGs to work around this. The only workaround is to create a "dummy"-rule to keep the NEG around and remove it later once the changes have propagates and every new request is served by the new backend.

Is this expected behavior? It would be great if the NEG is not immediately removed but kept for a few minutes to avoid 502s.

Jul 22 '22 08:07 trevex

/kind bug

Aug 29 '22 15:08 swetharepakula

Hi @trevex,

Thank you for creating the issue! We need a little more information to understand what could be the possible issue. Are you seeing that the NEG is being pre-maturely removed from the BackendService or that the NEG is being deleted before the BackendService is updated with the correct NEGs?

Thanks, Swetha

Aug 29 '22 15:08 swetharepakula

Hi @swetharepakula,

Basically the latter: The NEG is deleted before the BackendService is updated (or changes have properly propagated to GFE to be active) with the correct NEGs.

Aug 30 '22 10:08 trevex

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 28 '22 10:11 k8s-triage-robot

/remove-lifecycle stale

Dec 07 '22 17:12 kundan2707

@trevex, sorry for the delayed response. It sounds like you are asking for the ability to do a migration from one service to another without any downtime. Unfortunately with how the controllers work, this is not possible. When using Ingress, the NEG lifecycle is tied to the service being present in an Ingress. From the neg controller perspective, if the service is no longer specified as a part of an Ingress, the Neg controller will delete the NEG. This is expected behavior as this situation cannot be differentiated as removing a service from an Ingress. There is then a race between the Ingress controller updating the Backend Service before the Neg Controller deletes it.

If you would like to keep both around during some transition period, the method you mentioned is the best approach where you have a dummy path, so that the NEG controller will not delete the NEG and your ingress will get updated. However you have the risk of having that 3rd path exposed temporarily. Then after confirming that the BackendService is updated as expected, you can remove the service so that the old NEG is deleted.

Jan 23 '23 21:01 swetharepakula

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 23 '23 22:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 23 '23 22:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 22 '23 23:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 22 '23 23:06 k8s-ci-robot

ingress-gce ingress-gce copied to clipboard

Ingress changes resulting in 502s

ingress-gce
ingress-gce copied to clipboard