headscale
headscale copied to clipboard
Headscale removes node after it was re-added
Bug description
It seems there is an issue with nodes that are hopping between different networks. The most common scenario is a mobile device with a mobile network and WiFi connections. So when the node switches network, it sends an update and Headscale adds a new update channel for it. The problem is, that the new update channel rewrites the old one in the notifier instance before the cleanup of the old one takes place. So when it's time to close and delete the old channel, Headscale in fact removes the new one. I believe it's because there is no place for multiple update channels for a single node. If the node is fast enough to send a new update before the cleanup, the old channel will be overwritten, and the deferred remove procedure takes down the new one instead. Thus the node ends up offline and receives no updates from this point.
Environment
- OS: Ubuntu 22.04 5.15.0-97-generic
- Headscale version: v0.23.0-alpha5
- Tailscale version: v1.60.0, but could be any actually
- [ ] Headscale is behind a (reverse) proxy
- [ ] Headscale runs in a container
To Reproduce
Swap any node between networks, make it send new updates. Observe how update channels change
Logs and attachments
I've came up with some silly and naive quick-fix and I don't think it's a good one, because it may fix the outcome, but the original problem. Anyway, I haven't tested it yet, it's just an idea.
Firstly, we add a channel counter for each machine.
type Notifier struct {
l sync.RWMutex
nodes map[string]chan<- types.StateUpdate
connected map[key.MachinePublic]bool
hits map[key.MachinePublic]int // channel counter goes here
}
func NewNotifier() *Notifier {
return &Notifier{
nodes: make(map[string]chan<- types.StateUpdate),
connected: make(map[key.MachinePublic]bool),
hits: make(map[key.MachinePublic]int),
}
}
Then we increment this counter in the AddNode function
func (n *Notifier) AddNode(machineKey key.MachinePublic, c chan<- types.StateUpdate) {
log.Trace().Caller().Str("key", machineKey.ShortString()).Msg("acquiring lock to add node")
defer log.Trace().
Caller().
Str("key", machineKey.ShortString()).
Msg("releasing lock to add node")
n.l.Lock()
defer n.l.Unlock()
n.nodes[machineKey.String()] = c
n.connected[machineKey] = true
n.hits[machineKey]++ // increment counter
log.Trace().
Str("machine_key", machineKey.ShortString()).
Int("open_chans", len(n.nodes)).
Int("chan_hits", n.hits[machineKey]).
Msg("Added new channel")
}
And finally, we decrement the counter in the RemoveNode and remove the node completely only if it was the last one
func (n *Notifier) RemoveNode(machineKey key.MachinePublic) {
log.Trace().Caller().Str("key", machineKey.ShortString()).Msg("acquiring lock to remove node")
defer log.Trace().
Caller().
Str("key", machineKey.ShortString()).
Msg("releasing lock to remove node")
n.l.Lock()
defer n.l.Unlock()
if len(n.nodes) == 0 {
return
}
// only decrement counter if there was more than one channel
if n.hits[machineKey] > 1 {
n.hits[machineKey]--
log.Trace().
Str("machine_key", machineKey.ShortString()).
Int("open_chans", len(n.nodes)).
Int("chan_hits", n.hits[machineKey]).
Msg("Held channel")
return;
}
// remove the channel if it's the last one indeed
delete(n.nodes, machineKey.String())
n.connected[machineKey] = false
n.hits[machineKey] = 0;
log.Trace().
Str("machine_key", machineKey.ShortString()).
Int("open_chans", len(n.nodes)).
Msg("Removed channel")
}
Okay, I've tested this "workaround" and it seems to work. No undesired offline so far.
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[Yi5x7] open_chans=1
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[qOHEd] open_chans=2
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[OwOgH] open_chans=3
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[B0TdZ] open_chans=4
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=5
2024-02-29T19:12:51Z TRC Added new channel chan_hits=1 machine_key=[WHy0m] open_chans=6
2024-02-29T19:13:04Z TRC Removed channel machine_key=[F0v/+] open_chans=5
2024-02-29T19:13:10Z TRC Added new channel chan_hits=1 machine_key=[TlXS+] open_chans=6
2024-02-29T19:13:19Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:13:19Z TRC Removed channel machine_key=[F0v/+] open_chans=6
2024-02-29T19:13:19Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:14:07Z TRC Added new channel chan_hits=2 machine_key=[F0v/+] open_chans=7
2024-02-29T19:16:26Z TRC Held channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:16:26Z TRC Removed channel machine_key=[F0v/+] open_chans=6
2024-02-29T19:16:32Z TRC Added new channel chan_hits=1 machine_key=[F0v/+] open_chans=7
2024-02-29T19:18:32Z TRC Added new channel chan_hits=2 machine_key=[F0v/+] open_chans=7
2024-02-29T19:22:03Z TRC Held channel chan_hits=1 machine_key=[F0v/+] open_chans=7
Held channel
here means that the channel was prevented from being deleted and the node from going offline
2024-02-29T19:13:19Z TRC PeerChange received endpoints=["<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709233999 node_id=nodeid:2 online=true
2024-02-29T19:13:19Z TRC PeerChange received endpoints=["<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709233999 node_id=nodeid:2 online=true
2024-02-29T19:14:07Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709234047 node_id=nodeid:2 online=true
2024-02-29T19:16:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234192 node_id=nodeid:2 online=false
2024-02-29T19:16:47Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234207 node_id=nodeid:2 online=true
2024-02-29T19:17:08Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234228 node_id=nodeid:2 online=true
2024-02-29T19:17:55Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102","192.168.88.15:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234275 node_id=nodeid:2 online=true
2024-02-29T19:18:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostinfo_changed=true hostname="Google Pixel 7 Pro" last_seen=1709234312 node_id=nodeid:2 online=true
2024-02-29T19:18:32Z TRC PeerChange received endpoints=["<censored-mobile-ip>:7872","<censored-isp-ip>:54102","10.45.61.70:54102"] hostname="Google Pixel 7 Pro" last_seen=1709234312 node_id=nodeid:2 online=true
But the things is kinda difficult to me to propose something better than that
Could you please test the latest alpha?
I believe fixes in https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha12 should resolve this issue, let me now if not and we will reopen it.