fabio icon indicating copy to clipboard operation
fabio copied to clipboard

Health Check from LB to Fabio

Open deuch opened this issue 7 years ago • 3 comments

Hello,

The health check in Fabio i used by consul and always returned OK.

We have Load balancer in front of fabio (like a lot of people i suppose) with a health check to Fabio. We can not rely on Fabio Health check beacuse it's always return ok if Fabio is up, but imagine that the Consul Agent is down or the entire Consul cluster is not in a good shape. The health check provided by fabio is OK but you can not reach your services (and/or your are not sure of what you have in your routing table).

Is it possible to add a check of the consul agent (for example check the /v1/status/leader) to ensure that everything is OK on consul side. If not, the health page of Fabio must return a error code (> 200) to let the LB blacklist it to avoid returning error to client. With this, you will have a full chain of testing.

Same thing can be done with Vault (but i think it's let problematic, the older services will work but not the new one) as an optional check (like for consul).

deuch avatar May 05 '17 07:05 deuch

What should fabio conclude when consul is down? Does that mean your services are gone as well? How often do you have issues with your consul cluster? fabio will continue to serve the last known good routing table. In the two years I've had this in prod we didn't have issues with it. Just as with fabio consul was one of the tools we simply forgot was there.

If you don't want to rely on fabio then check consul instead.

magiconair avatar May 05 '17 09:05 magiconair

Ok cluster consul doesn't fail easily (we've issue but ok). Agent (client) consul can failed, and we are using it at localhost fot the HTTPS port (8543). So a LB can not have access to it to check the state ...

If we change that, we need to do a health check in HTTPS to consul agent and our load balancer do not handle HTTPS health check (need to have our root ca in it and/or certificates, and it's a nightmare to handle).

And yes, i prefer to disabled a fabio that can not access consul, because if a route is changed during the outage, the client will have error from fabio and it's rood because you have an another fabio instance that can respond correctly ... If the LB discard the fabio instance without consul, traffic will be redirected to the good one.

I can add a route or something like that, but a user can delete it and failed the fabio ... (and only in HTTPS) I dont want to open http and https (only https).

deuch avatar May 05 '17 10:05 deuch

I have an example use case here as well. We have several thousand services and about 7k routes. When a fabio restart (or starts up in the case of a new VM), that fabio goes healthy in consul because /health begins responding immediately. Then, that fabio starts receiving production traffic and responds with 404s because the route table isn't populated yet. It takes about 60 seconds for the route table to populate it seems. It would be nice if /health responded ok once the full startup finished. @leprechau, wanted to see if you had some thoughts on this one. Thanks!

codyja avatar Feb 27 '20 19:02 codyja