contour
contour copied to clipboard
Provide configuration option for envoy panic threshold
Currently, contour simply sets the panic threshold to 0 which means that if all backend pods report as unhealthy, incoming traffic will fail with a 503. For my use case of contour, this is undesirable. Keeping the default panic threshold at 0 and providing a override option would enable my use case and any other user of contour who does not want to return 503's in a panic scenario.
See related issue: https://github.com/projectcontour/contour/issues/579
Hey @nwohlgemuth would you want to set a default for ALL clusters or be able to configure per cluster?
What would the difference be? How could the default be for all clusters instead of per cluster?
It depends on where you might define this configuration. For example, we could define a default for all clusters in the config file, then allow each service to override the global default and define their own.
I'm trying to determine what configuration points might need to be exposed for your needs.
It certainly would be nice to allow each service to have its own setting but even just an overall default would be very helpful.
@nwohlgemuth
Currently, contour simply sets the panic threshold to 0 which means that if all backend pods report as unhealthy, incoming traffic will fail with a 503. For my use case of contour, this is undesirable.
Could you please expand on why this doesn't work for your application and what you would prefer to see instead. Thank you
It is unacceptable for my application to return a 503. All I need here is the ability to override the envoy panic threshold setting. It could be a global override setting and that would be sufficient.
@nwohlgemuth thank you for your reply. I'm sorry if my question appeared insensitive.
As I understand it iff you have active health checking configured and that health checking reports that 100% of the cluster members are unhealthy envoy will return a 50x. We configure panic threshold to 0 because, for reasons which are known only to the envoy developers, in the situation described above envoy will ignore its own health checks and use backends marked as unhealthy. Setting panic threshold to 0 disables this questionable behaviour and causes Envoy to return a 50x immediately, rather than forwarding the request to an unhealthy backend which will likely fail in a 50x circumstance.
With that background. If your health checks have placed all the backends into unhealthy state how can Envoy return a non 50x status code?
@nwohlgemuth gentle ping
@nwohlgemuth gentle ping
A backend service can be configured to return unhealthy when probed but still serve new requests. The goal of this approach would be to take a backend out of the pool of available backends when it gets too hot but if extenuating circumstances require, it would still be able to take on more traffic.
As I said, it is critical that our service do absolutely everything it can to avoid ever returning a 503 to a client. My perspective as a developer here is that I should be able to configure contour (well, envoy) to work how I want it to. I totally get your perspective on why that is normally undesirable. All I'm asking for here is the ability to turn a knob that envoy already has.
Thank you for your reply. I think I understand what you are asking for. You would like the option to raise the panic threshold so that if panic mode is triggered by envoys Heath checks envoy will ignore panic mode and forward results to a backend which has failed a healthy check in the expectation that it will still be able to service a request.
Is that correct?
On 21 Nov 2019, at 14:13, Nathan Wohlgemuth [email protected] wrote:
A backend service can be configured to return unhealthy when probed but still serve new requests. The goal of this approach would be to take a backend out of the pool of available backends when it gets too hot but if extenuating circumstances require, it would still be able to take on more traffic.
As I said, it is critical that our service do absolutely everything it can to avoid ever returning a 503 to a client. My perspective as a developer here is that I should be able to configure contour (well, envoy) to work how I want it to. I totally get your perspective on why that is normally undesirable. All I'm asking for here is the ability to turn a knob that envoy already has.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Yes, I would like the option to raise the panic threshold.
Probably the best way to add this would be a new field on health check, probably called (unimaginatively) panicThreshold. This should default to 0, which is both disabled and the current hard coded value.
On 21 Nov 2019, at 15:45, Nathan Wohlgemuth [email protected] wrote:
Yes, I would like the option to raise the panic threshold.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
sounds like this could also be done via degraded endpoint load balancing: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/degraded
Yeah, that does seem like it would help here as well. Probably, having the option to raise the panic threshold is okay, as long as we call out that you need to be sure it's doing what you think it is.
Just wanted to put another use case in the ring here for this feature: we would like to mark data plane nodes as unhealthy if their last successful update against our control plane gets too far out of date. In the case that there is a full outage of the control plane though, this puts us in a situation where all of our nodes will be marked as unhealthy and taken out of rotation. Using the panic threshold here would allow us to cleanly handle this case.