[Bug] Nodes fail to remove from notifier - ERR update not sent
Is this a support request?
- [X] This is not a support request
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
There is a two part bug related to node notification updates. When a node either loses connection/system goes to sleep, this logic expects that Notifier.RemoveNode will be called before the next sendAll loop, where the effected node won't be there anymore.
I have systems coming in and out of my network during the day, and when some go offline they're not correctly removed from the notifier, resulting in the notifier.sendAll function emitting the following logs until a disconnect event is finally emitted, or Headscale is rebooted.
2024-10-11T13:14:50Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:20Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:39Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:44Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:49Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:15:59Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:04Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:24Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:31Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:40Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:16:52Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:17:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:17:14Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:17:27Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:17:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:17:35Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:18:08Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:12Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:15Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:17Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:32Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:34Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
2024-10-11T13:31:37Z ERR update not sent, context cancelled error="context deadline exceeded" node.id=66
With the first problem of Notifier.RemoveNode not being called when a node goes offline in all instances, secondly this issue occurs where logic in sendAll will not work as intended, as any node in the n.nodes array after the failed node will not receive updates.
I have yet to pin point why the first issue occurs. I have a feeling the desktop Tailscale clients are not correctly calling disconnect when a machine goes to sleep, but eventually do at same point as the errors above stop after awhile when updating the sendAll functions to continue looping through nodes rather than returning early.
Expected Behavior
- When a node goes offline, it is correctly removed from the notifiers available nodes list.
- Even if a node is unreachable, the remaining nodes should continue to receive updates.
Steps To Reproduce
Reproducing I've personally found tricky. I'm deployed using the deb installation package + systemd and will have anywhere between 20 and 40 nodes on the network at a given time. Not all sleep events on machines or ephemeral nodes being removed cause the error to occur.
Environment
- OS: Debian Bookworm
- Headscale version: 0.23.0
- Tailscale version: 1.74.0
Runtime environment
- [ ] Headscale is behind a (reverse) proxy
- [ ] Headscale runs in a container
Anything else?
No response
@kradalby If we remove the return, is this problem solved?
I encountered the same problem, here is my log serve.log
Seeing this as well, I have to restart Headscale every few days when this is triggered to get my network working again.
FWIW as a temporary fix I set up a script (inspired by a Tailscale bug several of us did the same for) that just watches headscale logs and restarts it if it detects this (or the panic bug mentioned which seems to lead to this state anyway).
This is for headscale running in a podman container, and is designed to be run by a systemd user unit, but here it is as an example:
#!/bin/bash
logger -t headscale-monitor -p user.info "Monitoring headscale logs..."
while true; do
# Follow logs for headscale, ignoring any old lines
podman logs --follow --tail 0 headscale 2>&1 | while read line; do
if echo "$line" | grep -q 'update not sent, context cancelled error="context deadline exceeded"'; then
logger -t headscale-monitor -p user.info "$line"
logger -t headscale-monitor -p user.warning "Context cancelled bug detected. Restarting..."
podman restart headscale
# Give it a chance to restart
sleep 10
# break out of inner loop to discard any additional logs and start listening fresh
break
elif echo "$line" | grep -q "invalid memory address or nil pointer dereference"; then
logger -t headscale-monitor -p user.info "$line"
logger -t headscale-monitor -p user.warning "Invalid memory address bug detected. Restarting..."
podman restart headscale
# Give it a chance to restart
sleep 10
# break out of inner loop to discard any additional logs and start listening fresh
break
fi
done
done
As part of solving an update issue with tags and some longstanding architecture issues, this part has been changed so we should no longer drop updates on the floor in this case.
This is part of the work in #2670, particularly https://github.com/juanfont/headscale/commit/b6d5788231c865e3d7dfa81008349a0142fdde9d and is now in main. If you have the ability to test it that would be greatly appreciated. Either through main, or alpha/beta that we hopefully can release when we close up more of the milestone issues.
@dustinblackman I'll close this as the architecture has changed so much from the next release that the error by itself isnt going to be the same. Lets try that out and open a new issue if we see problems still.
This sounds great, thank you very much! I'll spin it up in a test cluster this weekend.
Getting this issue also. Is there any plan for release?