headscale
headscale copied to clipboard
Reduce failover time for subnet routers in HA setup
Feature request
I have been testing the subnet failover feature of the HA router as described in the Tailscale documentation: https://tailscale.com/kb/1115/subnet-failover/. I noticed that when there are two routers in the subnet advertising the same routes, and one of the routers goes down, it takes approximately 1 minute and ~10-15 seconds for traffic to start flowing through the backup router. As far as I can tell, 60 seconds of this delay is due to the keepAliveInterval
, which is hardcoded to 60 seconds.
https://github.com/juanfont/headscale/blob/bab4e14828e36f3bf86f3d2a8ae55b84b996a672/protocol_common_poll.go#L14
I propose that this parameter be made configurable and consider reducing the default value to 5 or 10 seconds to minimize failover time. What are your thoughts on this suggestion?
I can make PR if you agree move it to config.
@vsychov sounds reasonable.
@kradalby and I are in a refactoring hackathon today, which includes a major restructuring of the repo.
I will do a PR to make a keepAliveInterval configurable once we finish the code moves :)
@juanfont, I'm not sure how good of an idea this is, but it might work as well. When a connection with a client is lost, here:
https://github.com/juanfont/headscale/blob/9478c288f62b428348f57e8525126baef9955525/protocol_common_poll.go#L573-L581
We can check if we have other online nodes that announce the same route as the node with the broken connection, mark the current node as offline if other nodes are available, and switch the routes to an online node. This might lead to false positives in case of a short-term connection loss, but considering that we check the availability of other nodes, at least one node will be available.
This will help speed up the failover switch.
Any progress or update on this feature request?
I believe this will be addressed and improved when 0.23.0 lands, as part of #1564
This issue is stale because it has been open for 90 days with no activity.
could you check this in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4
Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?
Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?
Thanks @kradalby , I'll make tests today or tomorrow
I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.