contour Provide configuration option for envoy panic threshold

Currently, contour simply sets the panic threshold to 0 which means that if all backend pods report as unhealthy, incoming traffic will fail with a 503. For my use case of contour, this is undesirable. Keeping the default panic threshold at 0 and providing a override option would enable my use case and any other user of contour who does not want to return 503's in a panic scenario.

See related issue: https://github.com/projectcontour/contour/issues/579

Nov 02 '19 19:11 nwohlgemuth

Hey @nwohlgemuth would you want to set a default for ALL clusters or be able to configure per cluster?

Nov 04 '19 19:11 stevesloka

What would the difference be? How could the default be for all clusters instead of per cluster?

Nov 05 '19 05:11 nwohlgemuth

It depends on where you might define this configuration. For example, we could define a default for all clusters in the config file, then allow each service to override the global default and define their own.

I'm trying to determine what configuration points might need to be exposed for your needs.

Nov 05 '19 17:11 stevesloka

It certainly would be nice to allow each service to have its own setting but even just an overall default would be very helpful.

Nov 07 '19 04:11 nwohlgemuth

@nwohlgemuth

Currently, contour simply sets the panic threshold to 0 which means that if all backend pods report as unhealthy, incoming traffic will fail with a 503. For my use case of contour, this is undesirable.

Could you please expand on why this doesn't work for your application and what you would prefer to see instead. Thank you

Nov 08 '19 08:11 davecheney

It is unacceptable for my application to return a 503. All I need here is the ability to override the envoy panic threshold setting. It could be a global override setting and that would be sufficient.

Nov 11 '19 04:11 nwohlgemuth

@nwohlgemuth thank you for your reply. I'm sorry if my question appeared insensitive.

As I understand it iff you have active health checking configured and that health checking reports that 100% of the cluster members are unhealthy envoy will return a 50x. We configure panic threshold to 0 because, for reasons which are known only to the envoy developers, in the situation described above envoy will ignore its own health checks and use backends marked as unhealthy. Setting panic threshold to 0 disables this questionable behaviour and causes Envoy to return a 50x immediately, rather than forwarding the request to an unhealthy backend which will likely fail in a 50x circumstance.

With that background. If your health checks have placed all the backends into unhealthy state how can Envoy return a non 50x status code?

Nov 11 '19 11:11 davecheney

@nwohlgemuth gentle ping

Nov 18 '19 00:11 davecheney

@nwohlgemuth gentle ping

Nov 19 '19 22:11 davecheney

A backend service can be configured to return unhealthy when probed but still serve new requests. The goal of this approach would be to take a backend out of the pool of available backends when it gets too hot but if extenuating circumstances require, it would still be able to take on more traffic.

As I said, it is critical that our service do absolutely everything it can to avoid ever returning a 503 to a client. My perspective as a developer here is that I should be able to configure contour (well, envoy) to work how I want it to. I totally get your perspective on why that is normally undesirable. All I'm asking for here is the ability to turn a knob that envoy already has.

Nov 21 '19 03:11 nwohlgemuth

Thank you for your reply. I think I understand what you are asking for. You would like the option to raise the panic threshold so that if panic mode is triggered by envoys Heath checks envoy will ignore panic mode and forward results to a backend which has failed a healthy check in the expectation that it will still be able to service a request.

Is that correct?

On 21 Nov 2019, at 14:13, Nathan Wohlgemuth [email protected] wrote:

A backend service can be configured to return unhealthy when probed but still serve new requests. The goal of this approach would be to take a backend out of the pool of available backends when it gets too hot but if extenuating circumstances require, it would still be able to take on more traffic.

As I said, it is critical that our service do absolutely everything it can to avoid ever returning a 503 to a client. My perspective as a developer here is that I should be able to configure contour (well, envoy) to work how I want it to. I totally get your perspective on why that is normally undesirable. All I'm asking for here is the ability to turn a knob that envoy already has.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Nov 21 '19 03:11 davecheney

Yes, I would like the option to raise the panic threshold.

Nov 21 '19 04:11 nwohlgemuth

Probably the best way to add this would be a new field on health check, probably called (unimaginatively) panicThreshold. This should default to 0, which is both disabled and the current hard coded value.

On 21 Nov 2019, at 15:45, Nathan Wohlgemuth [email protected] wrote:

Yes, I would like the option to raise the panic threshold.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Nov 21 '19 09:11 davecheney

sounds like this could also be done via degraded endpoint load balancing: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/degraded

May 24 '21 21:05 sunjayBhatia

Yeah, that does seem like it would help here as well. Probably, having the option to raise the panic threshold is okay, as long as we call out that you need to be sure it's doing what you think it is.

May 26 '21 04:05 youngnick

Just wanted to put another use case in the ring here for this feature: we would like to mark data plane nodes as unhealthy if their last successful update against our control plane gets too far out of date. In the case that there is a full outage of the control plane though, this puts us in a situation where all of our nodes will be marked as unhealthy and taken out of rotation. Using the panic threshold here would allow us to cleanly handle this case.

Jun 17 '21 09:06 rsyvarth

contour contour copied to clipboard

Provide configuration option for envoy panic threshold

contour
contour copied to clipboard