fabio icon indicating copy to clipboard operation
fabio copied to clipboard

Best practices for sticky hot-spare configuration

Open sgrimm-sg opened this issue 8 years ago • 9 comments

We're using Fabio in front of a microservice that runs on a single node, but that has a hot spare on another node for failover. The goal is for Fabio to normally route 100% of requests to the primary node, but then switch over to the spare when the primary fails its health check. Once a failover has happened, it should be sticky, that is, the node that used to be the primary should then be considered the hot spare when it comes back online.

It seems like there are a few different ways to handle this in Fabio and it would be great if there were some guidance about the best approach.

Our current solution is to have the service register itself in Consul without a urlprefix- tag at startup. When it detects that it's the primary node (either because it's the first one running or because the primary has gone down) it reregisters with the urlprefix- tag.

https://github.com/hashicorp/consul/issues/1048 would be a clean solution here but in the meantime perhaps there's a better way of doing this than the one we settled on. It would be nice to avoid having to reregister the service.

sgrimm-sg avatar Feb 14 '17 15:02 sgrimm-sg

The quickest thing I can think of is to register a second health check with both services which is only green for the active instance. Then fabio should behave the way you want.

magiconair avatar Feb 14 '17 16:02 magiconair

Ah, that's an interesting idea. I think a variant of it might be better: rather than a second health check for the same service, instead register a second service with its own health check and tag that one with urlprefix-. That way the service for the hot spare node still shows up as healthy in Consul (meaning it can do stuff like create Consul sessions).

Thanks -- will give that a try.

sgrimm-sg avatar Feb 14 '17 18:02 sgrimm-sg

For that approach you can guard the active service with a consul lock. In essence, you're performing a leader election and only the leader is active.

magiconair avatar Feb 14 '17 19:02 magiconair

Yes, that's actually exactly what we're doing so our application knows internally when it has become the primary, but you can only start a session if you have a passing health check; without a healthy service to use for session establishment, the hot spare can't attempt to acquire the Consul lock. Kind of a chicken-and-egg problem.

sgrimm-sg avatar Feb 14 '17 19:02 sgrimm-sg

Would you mind posting your experiences here?

magiconair avatar Feb 16 '17 10:02 magiconair

Yesterday I tried switching to this setup (adding a second service with the urlprefix- tag and a health check that passed only when the instance was the primary). It worked fine, but after finishing it I realized it ended up being more code than my original approach. It was also slightly slower in that it couldn't switch over to the new primary until after a health check which might mean waiting for the health check interval, whereas with the "reregister when you get the lock" approach, the tags can be updated immediately when the lock is acquired.

What I'm currently doing looks roughly like this. It is a little complex because I want to ensure that, when I'm doing a clean shutdown of the current primary (software upgrades, etc.), there's never a period when Fabio has nowhere to route a client request.

  • At startup:
    • Deregister health check (in case it was in critical state due to an earlier unclean shutdown)
    • Register service without routing tag
    • Create Consul session
    • Poll the value of the primaryAddress Consul key and store it locally
  • While running:
    • Attempt to acquire lock on primaryAddress with our address as the value
    • If the lock is not held:
      • If local copy of primaryAddress is set:
        • Forward incoming client requests to the primary
      • If local copy of primaryAddress is not set:
        • Queue incoming client requests for later processing
    • When the lock is initially acquired:
      • Update local copy of primaryAddress with our address
      • Reregister service with routing tag and health check
      • Process any queued-up client requests
  • When shutting down:
    • If we were the primary instance:
      • Clear our local copy of primaryAddress
      • Wait for in-progress requests to complete
      • Release the lock on primaryAddress
      • If there are other instances registered in Consul:
        • While local primaryAddress copy is not set:
          • Send an HTTP request via Fabio's proxy port to hit an API endpoint that returns the value of the local primaryAddress copy of whichever instance answers the request
          • If it is not our address, set the local primaryAddress copy to its value
    • Deregister service
    • Stop listening for requests
    • If any requests were queued, forward them to the new primary
    • Wait for forwarded requests to finish

The "send a request through Fabio" sequence at the end means that during shutdown, there will be a brief window when both hosts' services are tagged with urlprefix-, but there should only be one host that actually does work at any given time.

The reason that the shutdown sequence queries the primary address through Fabio rather than relying on the state of the Consul key is because there is some nonzero amount of time between my service updating Consul and Fabio updating its routing tables, and I want to make sure the old primary doesn't stop accepting work until after I've confirmed that Fabio has started sending requests to the new one; otherwise there'd be a brief period of unavailability. Only tens of milliseconds, but my service's clients seem to excel at sending requests at the exact moment it goes offline briefly! The alternative would have been to query Fabio's routing config from Consul but then I'd have to parse Fabio's routing configuration in my application which seemed unnecessarily complex.

Feedback welcome, of course. Perhaps there's a simpler approach that would provide the same availability.

sgrimm-sg avatar Feb 16 '17 14:02 sgrimm-sg

Is there a window where both primary and secondary can handle requests? I'd guess so since you want the old primary to complete existing requests while the new primary starts handling requests.

Also, what kind of throughput are you looking at and what latency do you expect for the failover?

magiconair avatar Feb 17 '17 21:02 magiconair

I still think that adding a second health check for the service with a short check interval (1s or 500ms) which is only green for the leader is the simplest option. No re-registration necessary and since you already have the leader election code you only need to expose its status via the health endpoint.

Also, I'd optimize for the normal case which is orderly failovers.

  1. Primary and secondary start up and register both in consul with the urlprefix tag and a check for the leader (short interval). Both fail
  2. Leader election makes one of the health checks green and within the check interval fabio starts routing.
  3. On shutdown the primary gives up the lock, waits N*check interval and then fails the health check. Then it waits whatever grace period you need to complete existing requests. This provides a window long enough for the routing table to be updated. If you're really concerned you could fail the health check on the old master only after you've confirmed that the new master is accepting requests through fabio. That might be tricky depending on your network.

fabio will complete existing requests after switching the routing table.

magiconair avatar Feb 17 '17 21:02 magiconair

When I try to raise a ticket, found this one. It is still open and 2 years ago, not sure if there is the best approach in 2020.

--- My question: In nginx upstream, it supports a "backup" policy, from their spec: marks the server as a backup server. It will be passed requests when the primary servers are unavailable.

I want to do similar route policy on fabio, is that possible? I only find weight option, but it seems not perfect for this purpose.

My use case is, I have a service deploy to two physic servers. One has high profile(server A), another server is just for backup with lower profile(server B). I want to all traffic point to server A , except the server A is down.

GeniusWiki avatar Mar 19 '20 11:03 GeniusWiki