traefik icon indicating copy to clipboard operation
traefik copied to clipboard

Fail Open when all backends fail healthchecks

Open thrawn01 opened this issue 1 year ago • 4 comments

Welcome!

  • [X] Yes, I've searched similar issues on GitHub and didn't find any.
  • [X] Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What did you expect to see?

When all back ends in the load balancer fail their check, Traefik should provide an option to "Fail Open" such that in this scenario Traefik will route traffic to all targets regardless of health check status.

Purpose

Incorrect assumptions about what constitutes a failed /health endpoint on the part of the back end may result in a false report of it's status for all back ends. When this happens the site will be become unavailable until the condition is cleared or the health status is fixed.

This could be caused by

  • A bad deploy which causes the code behind the /health check condition to return a non 200 every time.
  • The health check condition metrics to change in an unexpected way (Thus resulting in the /health endpoint to return a non 200)
  • Configuration mistake makes the /health endpoint unreachable.
  • My cat removed the /health endpoint from the back end service because they didn't think it was necessary.
  • Various other imagined mistakes that could be made by kitty cats, in their pursuit of world domination.

This feature is already available on many cloud based load balancers

  • https://www.wellarchitectedlabs.com/reliability/300_labs/300_health_checks_and_dependencies/4_fail_open/
  • https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html
  • https://cloud.google.com/load-balancing/docs/internal#traffic_distribution

Proposal

~We modify the code in pkg/healthcheck/healthcheck.go:HealthCheck.checkServersLB() to iterate through all the HC urls to determine back end health. If the FailOpen health check config option is true. Then we compare the number of proposed disabledURLs to the total number of enabledURLs if they are equal, then call UpsertServer() on all disabledURLS.~ See comment below

We at https://github.com/mailgun/ are offering to contribute this work. This issue is to solicit feed back on the usefulness of this feature to the community, correct any assumptions we have around the implementation, and understand if there is something we missed that might also be useful.

Thank you for a wonderful product!

If need be, I can be contacted directly on gophers.slack.com as @Derrick Wippler or via email derrick at mailgun.com

thrawn01 avatar Aug 19 '23 16:08 thrawn01

After examining the code again, I have re-evaluated the implementation strategy

The current code, while not incorrect, has a sub optimal limitation. All the checks are preformed synchronously. In the case of a large pool of backends, health checks on all backends might not complete until after the next check interval is scheduled. While this is not exactly a problem, it might not be what the user expects. This can be exacerbated if the health check endpoints on the target server are non trivial, take some time to complete, and the defined timeout of the health checks are quite long.

Also, the current code disables or removes the server from the sever pool as soon as a health check fails. Which is generally good! However, it does greatly complicate the algorithm needed to support the fail open feature. It would be much simpler to evaluate the status of all the health checks as a whole, and then make a decision about which servers to remove from the backend pool.

If we were to keep the existing algorithm of removing each server from the pool, and assuming there is some error or issue where the health check returns non 200, then in the case that all backend servers fail their health checks the result would be a gradual draining of backends from the pool until it is empty and the fail open scenario is triggered, at which time the pool would fill with all the previously disabled backend servers. This outcome is sub optimal for sites who wish to maintain a quality of service through out.

If however, we switched to a "health check all the backend servers first" then, we can evaluate if the fail open has been triggered. If the fail open has not been triggered, then we can disable those servers which failed the backend health check or re-enable those who are no longer failing.

The issue with the later, is that we end up waiting until all the check are complete before disabling backend servers. (which is sub optimal).

A compromise would be to preform all the health checks for a backend in an asynchronous manner. (Run each check in it's own go routine) In this way, we can quickly preform all the health checks simultaneously and arrive at a complete picture of the health of all the backend servers. At that point we can then evaluate which servers should be disabled, or enabled and if the fail open scenario has should be triggered. This proposed implementation would not evaluate the status of the backend servers until all health checks have completed successfully or have timed out within their respective go routines.

This solution greatly simplifies the algorithm used to determine health checks and the fail open scenario while improving the situation when Traefik must make health checks on a large number of backend servers.

I realize, I've mostly been thinking of the use case where Traefik is used as an API gateway and not as a service mesh. However, as health checks are an opt-in feature, I think this still makes sense to pursue.

Thoughts?

thrawn01 avatar Aug 22 '23 18:08 thrawn01

Hello @thrawn01,

Thank you for the suggestion. We need more time to check it and investigate.

We'll keep you updated as soon as we have moved forward on it.

nmengin avatar Aug 28 '23 09:08 nmengin

Running into a similar problem. It'd be a nice addition to have a flag that just indicates Traefik to ignore the specific Container's health state and just route to that container anyway (if no alternative routes exist). However this may not solve everyones use-cases.

RoboMagus avatar Dec 19 '23 13:12 RoboMagus

Hello @thrawn01 , and thanks for your suggestion.

We are interested in this issue and the traction it will receive. We will leave the status as kind/proposal to give the community time to let us know if they would like this idea. We will reevaluate as people answer.

The conversation is time-boxed to six months.

lbenguigui avatar Feb 19 '24 15:02 lbenguigui