headscale icon indicating copy to clipboard operation
headscale copied to clipboard

Reduce failover time for subnet routers in HA setup

Open vsychov opened this issue 1 year ago • 9 comments

Feature request

I have been testing the subnet failover feature of the HA router as described in the Tailscale documentation: https://tailscale.com/kb/1115/subnet-failover/. I noticed that when there are two routers in the subnet advertising the same routes, and one of the routers goes down, it takes approximately 1 minute and ~10-15 seconds for traffic to start flowing through the backup router. As far as I can tell, 60 seconds of this delay is due to the keepAliveInterval, which is hardcoded to 60 seconds.

https://github.com/juanfont/headscale/blob/bab4e14828e36f3bf86f3d2a8ae55b84b996a672/protocol_common_poll.go#L14

I propose that this parameter be made configurable and consider reducing the default value to 5 or 10 seconds to minimize failover time. What are your thoughts on this suggestion?

I can make PR if you agree move it to config.

vsychov avatar May 09 '23 14:05 vsychov

@vsychov sounds reasonable.

@kradalby and I are in a refactoring hackathon today, which includes a major restructuring of the repo.

I will do a PR to make a keepAliveInterval configurable once we finish the code moves :)

juanfont avatar May 10 '23 08:05 juanfont

@juanfont, I'm not sure how good of an idea this is, but it might work as well. When a connection with a client is lost, here:

https://github.com/juanfont/headscale/blob/9478c288f62b428348f57e8525126baef9955525/protocol_common_poll.go#L573-L581

We can check if we have other online nodes that announce the same route as the node with the broken connection, mark the current node as offline if other nodes are available, and switch the routes to an online node. This might lead to false positives in case of a short-term connection loss, but considering that we check the availability of other nodes, at least one node will be available.

This will help speed up the failover switch.

vsychov avatar May 10 '23 09:05 vsychov

Any progress or update on this feature request?

RaheelJameel avatar Oct 31 '23 22:10 RaheelJameel

I believe this will be addressed and improved when 0.23.0 lands, as part of #1564

kradalby avatar Oct 31 '23 22:10 kradalby

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] avatar Jan 30 '24 01:01 github-actions[bot]

could you check this in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

kradalby avatar Feb 15 '24 10:02 kradalby

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

kradalby avatar Feb 19 '24 14:02 kradalby

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

kradalby avatar Apr 17 '24 06:04 kradalby

Thanks @kradalby , I'll make tests today or tomorrow

vsychov avatar Apr 18 '24 07:04 vsychov

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

kradalby avatar May 24 '24 09:05 kradalby