ingress-gce icon indicating copy to clipboard operation
ingress-gce copied to clipboard

Unable to avoid unhealthy backend / 502s on rolling deployments

Open rocketraman opened this issue 2 years ago • 47 comments

I have a GCE ingress in front of an HPA-managed deployment (at this time, with a single replica).

On a rolling deployment, I sometimes run into the backend being marked as unhealthy and resulting 502 errors, usually for about 15-20 seconds.

According to the pod events, the neg-readiness-reflector appears to mark cloud.google.com/load-balancer-neg-ready to True for the pod before it is actually ready:

Normal   LoadBalancerNegNotReady            18m                neg-readiness-reflector                Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-600f13cf-default-my-svc-8080-f82bf741]
Normal   LoadBalancerNegWithoutHealthCheck  16m                neg-readiness-reflector                Pod is in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-c\"}". NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
Warning  Unhealthy                          16m                kubelet                                Readiness probe failed: Get "http://10.129.128.130:8080/healthz": dial tcp 10.129.128.130:8080: connect: connection refused

While in this state, the previous pod terminates, but the load balancer does not route requests to the new pod, resulting in 502s.

I do have the deployment strategy set that should not allow this but I guess the neg being set to Ready is subverting this:

  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

My deployment does also define a readiness probe as can be seen in the events above.

I do also have a health check configuration defined for the backend:

apiVersion: v1
kind: Service
metadata:
  name: my-svc
  labels:
    app.kubernetes.io/name: mysvc
  annotations:
    cloud.google.com/backend-config: '{"ports": {"8080":"my-backendconfig"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: mysvc
  ports:
    - port: 8080
      protocol: TCP
      targetPort: 8080
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: my-backendconfig
spec:
  timeoutSec: 45
  connectionDraining:
    drainingTimeoutSec: 0
  healthCheck:
    checkIntervalSec: 5
    timeoutSec: 5
    healthyThreshold: 1
    unhealthyThreshold: 2
    type: HTTP
    requestPath: /healthz
    port: 8080

I found this stackoverflow in which the user works around the issue with delaying the pod stop with a sleep on the lifecycle.preStop, but that seems more like a hack than a proper solution to this issue: https://stackoverflow.com/questions/71127572/neg-is-not-attached-to-any-backendservice-with-health-checking.

rocketraman avatar May 19 '22 17:05 rocketraman

May also be related to https://github.com/kubernetes/ingress-gce/issues/1656.

rocketraman avatar May 19 '22 18:05 rocketraman

The NEG is not attached to any Backend Service with health checking message seems to be a red herring. The problem happens on every update, and I haven't seen that message again.

  Normal   Started                  7m49s                  kubelet                                Started container my-server
  Warning  Unhealthy                7m34s (x3 over 7m44s)  kubelet                                Readiness probe failed: Get "http://10.129.128.66:8080/healthz": dial tcp 10.129.128.66:8080: connect: connection refused
  Normal   LoadBalancerNegReady     7m31s                  neg-readiness-reflector                Pod has become Healthy in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-b\"}" attached to BackendService "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\"}". Marking condition "cloud.google.com/load-balancer-neg-ready" to True. 

As soon as the LoadBalancerNegReady event occurred, the 502s started happening, and kept happening (intermittently) for about 15-20 seconds.

rocketraman avatar May 19 '22 23:05 rocketraman

/kind support

A clarification question. Are you seeing that the requests are going to the pod that has been deleted when you get 502s or are seeing that the requests are going to the new pod?

I am wondering if the issue you are facing is that the old pod on termination has not been removed from the NEG yet, so there is a latency between the pod being removed in Kubernetes and the pod being removed from the NEG.

swetharepakula avatar May 20 '22 21:05 swetharepakula

A clarification question. Are you seeing that the requests are going to the pod that has been deleted when you get 502s or are seeing that the requests are going to the new pod?

I am wondering if the issue you are facing is that the old pod on termination has not been removed from the NEG yet, so there is a latency between the pod being removed in Kubernetes and the pod being removed from the NEG.

I can see that the 502 responses are interspersed with valid responses from the new pod, so I think you are right -- the load balancer 502 errors are because the load balancer is still sending requests to the pod that has terminated in k8s.

I thought the issue was with my setting of unhealthyThreshold: 10 on the health check, overriding the default of 2 causing the system to take too long to detect the terminated pod is unhealthy, but that wasn't the case. I've reset all configurations to what I believe are the defaults and still the 502s are unavoidable on a deployment update.

rocketraman avatar May 20 '22 22:05 rocketraman

Forcing the stopping container to stick around for a few extra seconds via lifecycle.preStop hook that sleeps for 30 seconds seems to be the only valid workaround I've found so far. This does also confirm the problem is with requests being sent to the terminating pod, not the new pod.

rocketraman avatar May 20 '22 23:05 rocketraman

Even lifecycle.preStop does not appear to be a 100% consistent workaround -- today during a deployment I saw two 502 errors while pinging the endpoint every 2s, even though the old pod was still in "Terminating" state due to lifecycle.preStop.

The new pod did have the error message I originally mentioned above:

  Normal   LoadBalancerNegWithoutHealthCheck  2m14s                  neg-readiness-reflector                Pod is in NEG "Key{\"k8s1-600f13cf-default-my-svc-8080-f82bf741\", zone: \"europe-west1-b\"}". NEG is not attached to any Backend
Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

in which the "cloud.google.com/load-balancer-neg-ready" condition appears to be marked True before it should be.

So, to summarize, it seems there are actually two problems contributing to 502 errors during rolling deployments:

  1. The load balancer is hitting the old pod even though it is being shut down (using lifecycle.preStop does seem to workaround this issue). This is consistent and easily reproduced.
  2. The load balancer is hitting the new pod even though it is not ready yet due to the "NEG is not attached to any Backend Service with health checking." error. This is intermittent and more difficult to reproduce, but not uncommon.

rocketraman avatar May 26 '22 14:05 rocketraman

@rocketraman, as this pod changes are occurring, are there any node or zone changes that are occurring?

swetharepakula avatar May 27 '22 19:05 swetharepakula

@swetharepakula Yes, this cluster is an autopilot cluster. When I deploy this update, the deployment generally requires a new node to be created to host the new pod.

rocketraman avatar May 27 '22 19:05 rocketraman

In addition, usually a few minutes later the pod is migrated back to the original node, and the new node is shutdown. That process rarely completes without some 502s as well.

rocketraman avatar May 27 '22 19:05 rocketraman

@rocketraman, is that node being created in a new zone? Is the autopilot cluster a regional cluster?

swetharepakula avatar May 27 '22 20:05 swetharepakula

is that node being created in a new zone? Is the autopilot cluster a regional cluster?

@swetharepakula I haven't checked specifically on the zone of the new node. It is a regional cluster (my understanding is that all autopilot clusters are regional clusters), so I presume autopilot is indeed balancing the new node across zones.

rocketraman avatar May 27 '22 20:05 rocketraman

Yes, I just confirmed the new node is in a different zone than the existing node that hosts that pod.

rocketraman avatar May 27 '22 20:05 rocketraman

So I believe the following is what is happening:

  1. The load balancer is hitting the old pod even though it is being shut down (using lifecycle.preStop does seem to workaround this issue). This is consistent and easily reproduced.

Neg controller does immediately respond to endpoint changes. However, there can be latency due to the time it takes for a detach operation to complete. In the time it takes the detach operation to complete, the load balancer may still route traffic to the terminating pod. There are a few options to mitigate this:

  1. The lifecycle.preStop as you are already using.
  2. use terminationGracePeriodSeconds so that your application will continue to accept traffic for a little longer.
  3. Adjust health checks to be sensitive enough to recognize that the endpoint is gone. This option will only reduce the frequency of 502s but won't necessarily eliminate them.

Since you are already doing (1), that is probably the easiest approach.

2.The load balancer is hitting the new pod even though it is not ready yet due to the "NEG is not attached to any Backend Service with health checking." error. This is intermittent and more difficult to reproduce, but not uncommon.

This is a race between the Ingress controller and a workload being scheduled/started on a new node. The NEG controller sees the update before the Ingress controller and adds the endpoint. However if this is on a node in a new zone, the NEG controller creates a new NEG in the new zone. The Ingress controller then needs to attach that NEG to the backend service. However if the Ingress controller doesn't finish that before the workloads are scheduled on the node, those new pods will have their readiness gates switched to ready immediately since the NEG will not be in any BackendService.

For non auto-pilot clusters our recommendation is to reduce the number of zone changes and try to run workloads in every zone the cluster is in. Since this is an autopilot cluster, your options may be limited in this regard.

At this time we are still looking into how to make the experience better for both of these cases.

swetharepakula avatar May 27 '22 23:05 swetharepakula

  • use terminationGracePeriodSeconds so that your application will continue to accept traffic for a little longer.

FYI I actually don't think this works -- at least it didn't when I tried it. I actually wouldn't expect it to: terminationGracePeriodSeconds tells k8s that if the workload takes longer to stop, to give it that extra time before forcefully terminating the pod. Changing the value to a larger value doesn't actually delay termination if the pod exits quickly on its own, which is what we want in this case. I believe the GKE docs on this are incorrect, and should recommend the lifecycle.preStop solution instead (with a corresponding increase in terminationGracePeriodSeconds if the default isn't high enough to deal with the normal shutdown plus preStop sleep time).

For non auto-pilot clusters our recommendation is to reduce the number of zone changes and try to run workloads in every zone the cluster is in. Since this is an autopilot cluster, your options may be limited in this regard.

That is too bad, but it does suggest a work-around. Scale up the deployment so that there is at least one pod per zone. That does raise system cost significantly when only one pod is required to meet load/redundancy requirements.

Will this become the tracking issue for improving this behavior, or can you reference me the issue I should follow?

rocketraman avatar May 30 '22 18:05 rocketraman

The workaround I suggested with terminationGracePeriodSeconds is not the documented way to use the field. Typically it gives pod more time to shutdown so it can gracefully exist as you describe it. To use the workaround I suggested, you would have to modify the application shutdown logic to not immediately start shutting but continue accepting requests but failing the healthcheck for a period of time. So your terminationGracePeriodSeconds would timeToShutdown + GCE Programming latency and your application would have to wait that GCE Programming Latency before beginning shutdown operations. In comparison the lifecycle.preStop is probably easier configure.

For now we will keep this issue open to communicate updates on this front. There isn't another issue open to track this work.

swetharepakula avatar Jun 22 '22 21:06 swetharepakula

I'm also getting 502 errors during availability zone changes. I added the cloud.google.com/neg: '{"ingress":false}' annotation to my service so that the load balancer forwards to all instance groups in the cluster and I think it solves the problem relatively cleanly.

Are there drawbacks to this workaround? Any reason it wasn't suggested in this thread before?

talzion12 avatar Sep 14 '22 08:09 talzion12

The only way I found to consistently work around this issue is to set a minReadySeconds on the deployment, to force the old pods to stick around for a while after the new pods are created.

brianstorti avatar Oct 20 '22 09:10 brianstorti

I think you missed this important message which shouldn't be occurring

Normal LoadBalancerNegWithoutHealthCheck 16m neg-readiness-reflector Pod is in NEG "Key{"k8s1-600f13cf-default-my-svc-8080-f82bf741", zone: "europe-west1-c"}". NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

This can occur If the attachement of the Pod to a NEG is done, but the backend isn't yet linked to this NEG. This is not managed by the reflector https://github.com/kubernetes/ingress-gce/blob/master/pkg/neg/readiness/reflector.go#L184

The result is that the old pod is detached and you have a wonderful backend service without any NEG and as a result 502 errors.

I also added minReadySeconds as a safety of this event to wait a bit before removing the old pod, but it isn't a 100% guaranteed success, especially if something is exceptionally slow at pod startup, as I have two backend services linked to my service, I often encounter this issue 😕.

saez0pub avatar Dec 07 '22 16:12 saez0pub

@swetharepakula what are the recommended steps to reliably recover a system experiencing this issue, separate from the preventative configuration you recommended? is deleting the service, ingress, and deployment sufficient, then reapplying those k8s resources sufficient? I'm not seeing intermittent 502s but rather persistent once the system enters this failure mode.

thankfully this is a development cluster and I have the option to completely destroy/deprovision and reprovision/deploy, but that's obviously not a viable option in a production cluster

revero-doug avatar Jan 23 '23 16:01 revero-doug

@talzion12 , in your approach, it means you have switched to using Instance Groups which is not our recommended or default stack. NEGs provide you with a container native solution, while the instance group solution is susceptible to the double hop problem. NEGs are our recommended approach.

@saez0pub , The neg controller and the Ingress controllers operate in parallel, so there can be a race when a new NEG is created but it has not been added to the backend service yet. We have released a fix as part of ingress-gce v1.20 that should ensure that NEGs are created are sooner (as soon as the node in the new zone is ready) which should hopefully give more time for the Ingress Controller to add the NEG to the backend service before the workloads are scheduled in the new zone.

@revero-doug , this does not sound like the a 502 due to a rolling development. Can you expand more on what the symptoms are, and what is occurring in the cluster? Since it is a different kind of an issue, can you open a new issue with those details. Thanks!

swetharepakula avatar Jan 23 '23 22:01 swetharepakula

@swetharepakula This started happening very frequently in one of our clusters after upgrading to 1.24. The k8s/gce-ingress have not been previously mentioned in this issue -- is it possible this race was introduced between 1.21 - 1.24?

The version mapping in the readme has not been updated in some time...is there any other way to determine what gce-ingress version we are using? Or what future k8s version the fix in v1.20 will be tied to?

bgroupe avatar Feb 06 '23 23:02 bgroupe

Hello, my version is 1.23.14-gke.1800.

As I'm using cloud native load balancing and Cluster IP, the NEG is created at each deployment. Expecting the neg creation at the node creation is not sufficient. So This race condition is very likely to happen.

saez0pub avatar Feb 20 '23 13:02 saez0pub

@swetharepakula I understand but I'd rather have slightly worse performance with the double hop than downtime with NEG unless I'm missing some other considerations.

talzion12 avatar Feb 20 '23 15:02 talzion12

You may find this GCP documentation helpful, as it describes the problems here are some possible solutions: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#traffic_does_not_reach_endpoints

derek-gfs avatar Mar 01 '23 12:03 derek-gfs

Indeed, but this is not a stop problem, it is a start problem, kube starts to delete the old pod while the gclb has not attached the new pod to the NEG. It takes a random time to auto fix, so 60 seconds is not always sufficient.

There is obviously a problem in the ingress management. I'm praying each time I deploy, kubernetes just brings me downtime 💢 .

Can't Google estimate the rate of occurences of LoadBalancerNegWithoutHealthCheck in their gke ?

saez0pub avatar Mar 04 '23 07:03 saez0pub

Is there any update on when we might see a fix for this problem hitting GCP itself? I'm seeing 502's during rolling upgrades of pods on an autopilot cluster, even though the new pod is fully warmed up (I'm using argo rollouts to enable blue-green deployments). The preStop config does help but does not eliminate the problem. I've tried to update my number of replicas to 3 to ensure that there are replicas running in every NEG, but that also doesn't resolve the issue. I'm seeing failed_to_pick_backend errors in the load balancer logs when the 502's occur.

denismccarthykerry avatar May 25 '23 22:05 denismccarthykerry

@denismccarthykerry did you try to add a minReadySeconds of 30 seconds to your deployment ? This is in general a good workaround to avoid the 502 errors.

To solve the issue, you need to ensure you are using container native load balancer https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing Take care of the special conditions where you may need to force the configurations detailed in this doc: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#container-native_load_balancing

HTH

GuillaumeMorini avatar May 26 '23 08:05 GuillaumeMorini

We have minReadySeconds set and are using container-native load balancing, and we still see 502s every now and then.

brianstorti avatar May 26 '23 13:05 brianstorti

@denismccarthykerry did you try to add a minReadySeconds of 30 seconds to your deployment ? This is in general a good workaround to avoid the 502 errors.

To solve the issue, you need to ensure you are using container native load balancer https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing Take care of the special conditions where you may need to force the configurations detailed in this doc: https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#container-native_load_balancing

HTH

Thanks for the reply Guillaume. I've tried minReadySeconds (my services were already annotated with cloud.google.com/neg: '{"ingress": true}'as they fulfilled the criteria for container-native loadbalancing) but I get the same issue. I'm baffled. I've gone through everything I can find online to resolve this, but nothing seems to work. The preStop hook at least reduces the window during which I get 502's to maybe 5 seconds, but I still get plenty of them during that time.

If I knew it was a known bug at least I could rest a little easier that it's not something I'm doing wrong myself...

denismccarthykerry avatar May 26 '23 22:05 denismccarthykerry

@denismccarthykerry Is connectionDraining.drainingTimeoutSec necessary to solve the issue? My understanding of that is that would affect existing requests, but not new ones, so it would surprise me if that has an affect on solving the 502s, unless you were only seeing the 502s for requests started before the deployment began. Though the docs do have the statement "The load balancer does not send new requests to the removed backend." I thought that was already the case without this setting as well.

I'm also a bit surprised that adding healthCheck.checkIntervalSec = 15 solves anything, as the default health check actually has a shorter interval of 5s, and so your config should actually take longer to detect a terminated node, which is the opposite of what you want here. See comment https://github.com/kubernetes/ingress-gce/issues/1718#issuecomment-1140097570.

rocketraman avatar May 26 '23 23:05 rocketraman

You're right, that was not it. It did make it appear that the issue was resolved, but only because the health check itself was not configured correctly, so the deployment never routed traffic to the new node - so obviously I didn't get the 502's on switchover. I've deleted the comment so as not to mislead people. Looks like this one is still unresolved...

denismccarthykerry avatar May 27 '23 07:05 denismccarthykerry