headscale icon indicating copy to clipboard operation
headscale copied to clipboard

Frequent "offline" status causing subnet router re-election and connection disruptions

Open vsychov opened this issue 1 year ago • 10 comments

Hello,

I have noticed a recurring issue where I often see console messages from headscale indicating that a machine has gone "offline", even though the machine is actually online and has no issues with its internet connection. As I am using tailscale as a subnet router, this results in re-election of the "primary route" if such a machine was being used as the "primary route", leading to connection disruptions.

It appears that the problem lies in how a machine is set to "offline" mode, using the last_seen field in the database. A machine goes offline when the last_seen field reaches a value of 60 seconds (keepAliveInterval). Therefore, even a slight delay of just an extra second can make the machine go offline, leading to a new subnet router being elected.

It looks like field last_seen updated in keepAliveTicker and few other places, and it's happens each 40-60 seconds in my setup, that's not enough.

From what I can see, this problem could be solved by updating the last_seen field in the updateCheckerTicker (which by default occurs every 10 seconds - NodeUpdateCheckInterval), simply by adding:

machine.LastSeen = &now

right after: https://github.com/juanfont/headscale/blob/fe75b716201a2d31bd8fe2531100e93ff7bfb4f1/hscontrol/poll.go#L561

I hope this suggestion is helpful and look forward to any feedback.

Thank you

vsychov avatar Jun 30 '23 21:06 vsychov

This might be fixed, or we might have the base to fix this when #1492 land, it starts looking at the Online field, and sends update in a different way. It might not have been directly addressed, but should be easier to fix.

kradalby avatar Jul 07 '23 14:07 kradalby

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] avatar Dec 24 '23 01:12 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Dec 31 '23 01:12 github-actions[bot]

This is still an active issue in the latest stable version. Is this fixed in the latest alpha and is the latest alpha ready for use in a prod-like environment?

andreyrd avatar Jan 17 '24 13:01 andreyrd

@andreyrd we follow common software release practices and alpha software is not recommended to use in production, we need help testing it so we release it under a alpha/beta label to imply that you need to be cautious using this.

I believe the issue has been solved, but we need people who encounter the problem to test it, if you have the opportunity, that would be great.

kradalby avatar Jan 19 '24 08:01 kradalby

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

kradalby avatar Feb 19 '24 14:02 kradalby

@kradalby with the latest 0.23.5-alpha5 there is an odd behavior. I constantly see my android clients go offline while they continue to work fine with the tailnet. But Headscale seems to stop sending updates to them. To make them become online again they need to send some updates by themselves, like moving to a different network, or if I manually restart Tailscale connection on them.

Once I've seen the very same on my Raspberry Pi, but only once, and I'm not sure what the cause was. Other linux clients stay online without an issue.

Update: Going offline is not instant. The android nodes stay online for some time, like hours. More interestingly, "offline nodes" may have kinda fresh last seen value, like one minute ago.

Update2: I believe it can be reproduced by switching networks. Like the next scenario:

  1. Activate Tailscale on Android while being on the home Wi-Fi. Node stays online
  2. Turn off Wi-Fi, forcing the phone to switch to the mobile connection. Node stays online
  3. Turn on Wi-Fi. Node goes offline, last seen value continues to update

eNdiD avatar Feb 27 '24 08:02 eNdiD

I also found this issue with the 0.23.5+ version, by some investigation, I think it may be caused by existing connection to controller have been reset (by switching the router /wifi because it may switch the NAT outside address or other reasons) and meanwhile the new connection established quickly, in that case, in poll.go the old connection's defer action may be executed after e the new conection being added because the online status is now a map index by node key.

fortitudepub avatar Mar 19 '24 02:03 fortitudepub

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

kradalby avatar Apr 17 '24 06:04 kradalby

Thanks @kradalby , I'll make tests today or tomorrow

vsychov avatar Apr 18 '24 07:04 vsychov

I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.

kradalby avatar May 24 '24 09:05 kradalby