ingress-gce icon indicating copy to clipboard operation
ingress-gce copied to clipboard

Document how to avoid 502s

Open bowei opened this issue 8 years ago • 32 comments

From @esseti on September 20, 2017 9:22

Hello, i've a problem with the ingress and the fact that the 502 page pops up when there are "several" request. I've a JMeter spinning 10 threads for 20 times, and I get more than 50 times the 502 over 2000 calls in total (less than 0,5%).

reading the readme it says it says that this error is probably due to

The loadbalancer is probably bootstrapping itself.

but the loadbalancer is already there, so does it means that all the pods serving that url are busy? is there a way to avoid the 502 waiting for a pod to be free?

if not, is there a way to personalize the 502 page? because I expose APIs in JSON format, and I would like to show a JSON error rather than a html page.

Copied from original issue: kubernetes/ingress-nginx#1396

bowei avatar Oct 11 '17 17:10 bowei

From @nicksardo on September 26, 2017 16:12

https://serverfault.com/questions/849230/is-there-a-way-to-use-customized-502-page-for-load-balancer This is a question for GCP, not the ingress controller. Though, I suggest you investigate why you're getting 502s.

bowei avatar Oct 11 '17 17:10 bowei

From @esseti on September 29, 2017 9:53

Regarding the several 502, i found out that it's due to how long the LB keeps the connection alive vs what the container provides as keepalive timout. it's explained here (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

In my case I also added a timout of 5s to the probe, not sure but that solved the 502 (i'm using uwsgi)

bowei avatar Oct 11 '17 17:10 bowei

I also get google's 502 html error page and would like to understand why and how to avoid it or customize the response. The backend pods have been running without restarting but still maybe 1/1000 requests return a 502. Using GKE with ingress that sends to API pod running nginx.

montanaflynn avatar Oct 27 '17 00:10 montanaflynn

@montanaflynn have you tried this (point 3) https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 ? (i also had to increase the time of probeness, there are my comments on that page). I've solved the problem with the 502 page, still not able to modified it but that it's not part of kubernetes.

esseti avatar Oct 27 '17 12:10 esseti

Using GKE 1.7.8 (google cloud)

I'm getting these too, and backend services have 2/2 for cluster health and green. Using Ingress with kube-lego and gce for TLS provisioning. One app served by ingress (name-based virtual hosts from single ingress) works fine but other app every other request returning 502 and preventing QA. There were zero issues with either app during initial QA when behind LoadBalancer service. I changed to NodePort services and added Ingress with TLS in front of them and plagued with 502 errors.

Updated liveness and readiness probes, and confirmed 200 responses both at probe URI and / but still getting these errors.

Redacted Ingress Config

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "kubernetes-ingress-stg"
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "gce"
spec:
  tls:
  - hosts:
    - eval.redacted-site1.com
    - eval.redacted-site2.com
    secretName: legacy-tls
  rules:
  - host: eval.redacted-site1.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site1-app
          servicePort: 80
  - host: eval.redacted-site2.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: site2-app
          servicePort: 80

mikesparr avatar Oct 28 '17 03:10 mikesparr

@esseti I tried increasing the timeout as suggested but still get 502s

Also like @mike-saparov we're using TLS with the ingress (not acme) and NodePort services.

montanaflynn avatar Oct 28 '17 03:10 montanaflynn

I increased keepalive too per recommendation and didn't fix.

mikesparr avatar Oct 28 '17 06:10 mikesparr

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Jan 26 '18 06:01 fejta-bot

We're also having these issues. Using NodePort on our services, TLS ingress with kube-lego.

We noticed 502 right after this message showed up in kubectl get events:

2m          17d          2637     ingressID                                  Ingress                                                     Normal    Service                 loadbalancer-controller                            default backend set to serviceID

Any ideas how to figure out what is causing these 502's?

lunemec avatar Feb 23 '18 09:02 lunemec

/remove-lifecycle stale

lunemec avatar Feb 23 '18 09:02 lunemec

I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s

My backends occasionally say "unhealthy" though for some reason...

Even though I've fulfilled this comment: Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic. https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer

gylu avatar Mar 30 '18 23:03 gylu

I increased the VM size and errors subsided. It appears the containers would crash due to memory limits, then as new ones spun up the health check failed and Ingress served up 502.

No smoking gun but that solved for me. You may be underutilized.

Sent from my iPhone

On Mar 30, 2018, at 5:47 PM, George [email protected] wrote:

I'm seeing these too. Using 1.8.6 with Kubefed trying to set up a federated Ingress (which I think I got set up), but I now keep getting 502s, and there's nothing to log/debug except stackdriver, which shows 502s

My backends occasionally say "unhealthy" though for some reason...

Even though I've fulfilled this comment: Services exposed through an Ingress must serve a response with HTTP 200 status to the GET requests on / path. This is used for health checking. If your application does not serve HTTP 200 on /, the backend will be marked unhealthy and will not get traffic. https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mikesparr avatar Mar 31 '18 05:03 mikesparr

The simple way to avoid 502s: setup a cluster that hosts only your app and does not use preemptible nodes or cluster node-pool autoscaling. Schedule app downtime for node upgrades.

If you want to avoid 502s and want cluster autoscaling, preemptibe nodes or zero downtime, you probably need to switch to the nginx ingress controller. The L7 load balancer lives in the cluster and is able to respond faster and more proactively to events occurring in the cluster. The built-in retry logic also helps.

The GCE ingress controller creates an L7 load balancer that communicates to a kubernetes NodePort service. If you use the default settings for your service, externalTrafficPolicy will be set to Cluster, meaning every node will forward requests to nodes that host the pods backing the service (which I'll just call your app). If you leave externalTrafficPolicy=Cluster, any node can cause 502s and timeouts, even if it is not running your app. Examples: you take an unrelated node down for an upgrade, even if you cordon/drain it properly; a random node in your cluster crashes (as @mikesparr noted, this could be a node out of memory); you use preemptible nodes; you have some nodes that are over-provisioned (traffic forwarding can be CPU starved?). One final note about leaving externalTrafficPolicy=Cluster: the backend for your app will show as available with N instances, where N is the number of nodes in your cluster, and it will show as available in every zone where you have nodes. This is misleading because requests have to be served by nodes hosting the pods, which is an arbitrarily small subset of the cluster nodes. Maybe this situation will improve with network endpoint group support in the GCE ingress controller?

Setting externalTrafficPolicy=Local on the NodePort service will prevent nodes from forwarding traffic to other nodes. Health checks will fail for any node not hosting your app, and your load balancer backend will show the proper number of nodes serving your app and the zone they're in. This removes the 502 caused by unrelated nodes, but you will still get them if a node hosting your app is unhealthy. We've just introduced a new failure mode: if a pod dies on any node, your service is down on that node. You can fix this by ensuring at least two pods are running on each node where your service lives. You may also want to set strategy.rollingUpdate.maxUnavailable=0 on your deployment so it will create new pods before deleting the old ones. The GCE ingress controller on GKE adds a minute to whatever health check interval I set, which is too slow to find a dead node.

petercgrant avatar Jun 20 '18 17:06 petercgrant

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Sep 18 '18 17:09 fejta-bot

/remove-lifecycle stale

metral avatar Sep 18 '18 17:09 metral

This thread was helpful for [crossing my fingers for now] eliminating my 502s. Just in case they come back though, is it possible to customize the response body? Have tried digging quite a bit without luck so guessing no, but asking here to be extra sure [noticed the title of this issue originally reference personalizing 502s as well]

wminshew avatar Oct 11 '18 03:10 wminshew

/lifecycle frozen

bowei avatar Nov 06 '18 19:11 bowei

I believe some of this issues will be solved by the new Network Endpoints Group load balancing (https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing) due to the removed network hop in the kube-proxy

acasademont avatar Dec 03 '18 20:12 acasademont

would you be able to deploy container-native-load-balancing alongside nginx-ingress-controller?

Arconapalus avatar Dec 12 '18 22:12 Arconapalus

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650;
keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

stefanotto avatar Feb 12 '19 19:02 stefanotto

I believe you must create a ConfigMap and that is what overrides Nginx. I'm currently using GCE ingress but if memory serves, that is what you need to add.

On Tue, Feb 12, 2019 at 12:15 PM Stefan [email protected] wrote:

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650; keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-gce/issues/34#issuecomment-462895097, or mute the thread https://github.com/notifications/unsubscribe-auth/AFS70ZTTna19sR5f-1A0NsNV--ADqpWzks5vMxLugaJpZM4P13yZ .

mikesparr avatar Feb 12 '19 20:02 mikesparr

Here's an example that might get you started, Stefan: https://github.com/nginxinc/kubernetes-ingress/blob/master/docs/configmap-and-annotations.md

On Tue, Feb 12, 2019 at 1:04 PM Mike Sparr [email protected] wrote:

I believe you must create a ConfigMap and that is what overrides Nginx. I'm currently using GCE ingress but if memory serves, that is what you need to add.

On Tue, Feb 12, 2019 at 12:15 PM Stefan [email protected] wrote:

Hi, where exactly can I set the two NGINX settings described?

keepalive_timeout 650; keepalive_requests 10000;

I have an Ingress based on nginx-ingress-controller. How exactly can I pass these to the NGINX used in the image?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-gce/issues/34#issuecomment-462895097, or mute the thread https://github.com/notifications/unsubscribe-auth/AFS70ZTTna19sR5f-1A0NsNV--ADqpWzks5vMxLugaJpZM4P13yZ .

mikesparr avatar Feb 12 '19 20:02 mikesparr

Perfect. Thank you so much @mikesparr

stefanotto avatar Feb 12 '19 21:02 stefanotto

Is the solution here to move to nginx ingress controller? This seems like a good workaround. Is there any downsides to doing this? Will it still work with a nodeport service?
Seems a little crazy to me that this isn't fixed in the gce controller.

keperry avatar Mar 13 '19 21:03 keperry

Normally the 502s are from a failed health check and I've found playing around with the initialDelaySeconds on readiness probe, etc. to provide ample time for Docker build / deploy reduced a lot. Furthermore, I Dockerized some legacy stuff from merger in PHP using sessions so the health check in correlation to out of memory in pod forcing destroy / rebuild are main causes.

The health check intervals are probably first place to tweak but it is trial/error depending on your app. I am directing all our projects to leverage Docker's multi-stage build v17.1 and later and in Node apps we saw image size reduce from 225MB to 71MB, further speeding up deployments and minimizing health check timeout risk. Golang images are under 10MB in some cases so they are awesome. ;-)

Hope that helps.

On Wed, Mar 13, 2019 at 3:19 PM keperry [email protected] wrote:

Is the solution here to move to nginx ingress controller? This seems like a good workaround. Is there any downsides to doing this? Will it still work with a nodeport service? Seems a little crazy to me that this isn't fixed in the gce controller.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-gce/issues/34#issuecomment-472609959, or mute the thread https://github.com/notifications/unsubscribe-auth/AFS70R2vUxF-wTUu7hffLjm2yIsFwD0Fks5vWWtqgaJpZM4P13yZ .

mikesparr avatar Mar 13 '19 22:03 mikesparr

Hi, I'm still experiencing this issue, although I added the upstream-keepalive-requests and upstream-keepalive-timeout settings in the config map. I actually did not setup the k8s cluster myself. A config map by the name nginx-ingress-controller was already present so I just assumed it was used by the nginx-ingress. But now I'm not so sure it is. How can make sure the map is in use and the settings are actually used by Nginx? Sadly, I find the documentation far from clear on how to connect the config map with the nginx-ingress load balancer.

stefanotto avatar Apr 11 '19 18:04 stefanotto

There are something you could try.

  1. use NEG, Network Endpoint Groups
  2. graceful shutdown use preStop
  3. turn http keep-alive off

axot avatar Apr 12 '19 01:04 axot

Ref: https://github.com/kubernetes/ingress-gce/issues/769

freehan avatar Jul 26 '19 16:07 freehan

I sent a feedback about that containers should return 200 at / here: https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress That would improve understanding at that point I think. Perhaps more people can do that so it will be added there.

VGerris avatar May 14 '20 09:05 VGerris

Changing externalTrafficPolicy from Local to Cluster has fixed seemingly random 502 errors with low or single-replica deployments for one project.

Sent from my iPhone

On May 14, 2020, at 3:12 AM, OpenMinded [email protected] wrote:

 I sent a feedback about that containers should return 200 at / here: https://cloud.google.com/kubernetes-engine/docs/how-to/load-balance-ingress That would improve understanding at that point I think. Perhaps more people can do that so it will be added there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mikesparr avatar May 14 '20 14:05 mikesparr