headscale icon indicating copy to clipboard operation
headscale copied to clipboard

[Bug] Headscale serving subnet from offline node.

Open pupaxxo opened this issue 4 months ago • 6 comments

Is this a support request?

  • [x] This is not a support request

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

Headscale is serving routes from an offline node:

The nodes list command output is:

ID | Hostname                    | Name                        | MachineKey | NodeKey | User     | IP addresses                    | Ephemeral | Last seen           | Expiration          | Connected | Expired
3  | <redacted>                         | <redacted>                         | [0eCEd]    | [/eEPF] | internal | 100.65.0.4, fd7a:115c:a1e1::4   | false     | 2025-08-22 10:04:40 | N/A                 | online    | no
4  | <redacted> | <redacted> | [1T/pb]    | [n9Srg] | cassa    | 100.65.0.5, fd7a:115c:a1e1::5   | false     | 2025-08-22 09:45:02 | N/A                 | online    | no
6  | <redacted>        | <redacted>           | [lrXNR]    | [yhjhi] | cassa    | 100.65.0.7, fd7a:115c:a1e1::7   | false     | 2025-07-04 08:17:20 | 1970-01-01 00:02:03 | offline   | yes
7  | <redacted>                      | <redacted>                      | [Cpdpz]    | [f6fqC] | cassa    | 100.65.0.1, fd7a:115c:a1e1::1   | false     | 2025-08-22 10:05:40 | N/A                 | offline   | no
8  | <redacted>                      | <redacted>                      | [vqbGX]    | [N+zcw] | cassa    | 100.65.0.6, fd7a:115c:a1e1::6   | false     | 2025-08-22 10:03:27 | N/A                 | offline   | no
9  | <redacted>                      | <redacted>                      | [MNbbI]    | [88f3J] | cassa    | 100.65.0.8, fd7a:115c:a1e1::8   | false     | 2025-08-22 10:12:26 | N/A                 | online    | no
10 | <redacted>                      | <redacted>                      | [WwX6Q]    | [WEaUd] | cassa    | 100.65.0.9, fd7a:115c:a1e1::9   | false     | 2025-08-22 10:08:11 | N/A                 | online    | no
11 | <redacted>                     | <redacted>                     | [xkeLX]    | [5XTVi] | cassa    | 100.65.0.11, fd7a:115c:a1e1::b  | false     | 2025-08-22 10:10:01 | N/A                 | online    | no
12 | <redacted>         | <redacted>         | [Cv2I7]    | [g5cTw] | cassa    | 100.65.0.12, fd7a:115c:a1e1::c  | false     | 2025-08-22 10:13:49 | N/A                 | offline   | no
13 | <redacted>              | <redacted>              | [sbrlf]    | [YQdPd] | cassa    | 100.65.0.13, fd7a:115c:a1e1::d  | false     | 2025-08-22 10:13:05 | N/A                 | offline   | no
14 | <redacted>                | <redacted>                | [elQmK]    | [2sMcI] | cassa    | 100.65.0.16, fd7a:115c:a1e1::10 | false     | 2025-08-22 09:24:10 | N/A                 | offline   | no

The routes list response is:

ID | Hostname                    | Approved                                                           | Available                                                          | Serving (Primary)
4  | <redacted> | 192.168.200.80/32                                                  | 192.168.200.80/32                                                  | 192.168.200.80/32
6  | <redacted>           | 192.168.2.82/32                                                    | 192.168.2.82/32                                                    |
7  | <redacted>                      | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
8  | <redacted>                      | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
9  | <redacted>                      | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
10 | <redacted>                      | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
11 | <redacted>                     | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
12 | <redacted>         | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
13 | <redacted>              | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 |
14 | <redacted>                | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32 | 192.168.3.44/32, 192.168.3.60/32, 192.168.3.63/32, 192.168.3.68/32

As you can see the node 14 is serving the primary route, but the node 14 is marked as "offline".

Expected Behavior

The serving routes switches to an online node.

Steps To Reproduce

  1. Approve the same route on multiple nodes.
  2. Put a node offline.

Environment

- OS: Debian
- Headscale version: 0.26.1
- Tailscale version: Latest

Runtime environment

  • [x] Headscale is behind a (reverse) proxy
  • [ ] Headscale runs in a container

Debug information

The policy allows all traffic. Headscale has the default configuration.

pupaxxo avatar Aug 22 '25 10:08 pupaxxo

The serving routes switches to an online node.

How went the node 14 offline (os shutdown, stop tailscaled, poweroff, …)? Do you see anything related in the logs for node 14? Does the switch to another node work if you "properly" stop tailscaled on node 14?

You might be hit by this: https://headscale.net/development/ref/routes/#high-availability

nblock avatar Aug 22 '25 11:08 nblock

Hi,

the node was not properly shutdown, but, from the headscale nodes list command output, headscale seems to detect the node as offline. The other node that was not properly shutdown required a few minutes to start beign detected as offline.

pupaxxo avatar Aug 22 '25 12:08 pupaxxo

the node was not properly shutdown, but, from the headscale nodes list command output, headscale seems to detect the node as offline. The other node that was not properly shutdown required a few minutes to start beign detected as offline.

Can you check if node switching is fast when the primary node is properly shutdown? It should be fairly quick.

It seems to be related to: #2129

nblock avatar Aug 23 '25 15:08 nblock

Do you have the opportunity to test main? The logic has changed a bit, and might have improved with latest changes.

kradalby avatar Sep 09 '25 09:09 kradalby

@pupaxxo It'd be great if you could test with 0.27.0-beta.1.

nblock avatar Oct 19 '25 15:10 nblock

Hi! Sorry for the dalayed response, the system is actually beign used and it's not so easy to test, I'l try to schedule a live-test with the customer in the next days.

pupaxxo avatar Oct 20 '25 19:10 pupaxxo

It should work in 0.27.1. Let us know if you still find this issue in your environment.

nblock avatar Nov 12 '25 05:11 nblock