headscale Multi server support

Feature request Multiple server support

A Way to have multiple server nodes so that we can have mesh and the user can connect to the nearest server and

I have multiple servers in different regions so it was better if i would have I node per region so the latency would be low and the load will also be low

Thanks!!

May 27 '22 12:05 nabeelshaikh7

I have not tried it, but I suppose if you have a central database, sharing it across control servers may be possible. The downside is, doing that bypasses any application level locks, which may introduces race conditions around machine registration/ip address allocation.

But then again, since the control server should only provide information about a network, and not forward any traffic by itself, I am not sure it is worth optimizing them for latency. If my interpretation is correct, "latency" in this sense would be "how quickly would existing clients observe a new node joining the network".

May 31 '22 16:05 enoperm

In principle, the amount of traffic going between the control server (headscale) and the tailscale clients are not really affected by latency (unless its so bad it times out, but then I suspect you have other issues).

For DERP relays, lower latency could make sense, and you can host those separately from headscale.

Jun 12 '22 13:06 kradalby

There are scenarios where multiple servers would make sense and allowing them to connect would also make sense:

Multiple headscale servers, allowing two companies/owners to share nodes between them
Redundancy

Jun 12 '22 13:06 kradalby

What consequences would nodes being shared across control servers have on ACLs and security in general? I think it can be done safely as long as the servers can keep track whence each tag/rule originates, but it sure sounds easier to screw up than a single central ACL.
As for redundancy, I think moving all state (including any locks around db insertions) to the database server (how does gin handle transactions?) would allow the setup mentioned above, and it should be simpler than anything that would depend on control servers directly communicating.

Jun 12 '22 13:06 enoperm

Though even if the database could be shared, one also needs to ensure the ACLs and the DERP map remain consistent across control servers.

Jun 12 '22 13:06 enoperm

This may sound crazy, but how about removing hard dependencies on exact config files and datastores/schemas, and letting users write their own behaviour in some glue/scripting language? As long as APIs are provided to them, they can decide what ACLs exists (for updates, just ask their script again), they'll know if the rules they wish to give out are handcrafted, come from a config file, or a database, or generated on the fly. Same for nodes, instead of hardwiring the address allocation/node listing logic, call into their machine_register, or machine_enumerate functions - this way they can share nodes, set up their own machine registration logic (this would allow for any authentication machinery to be used, including external OIDC/SAML/Basic Auth/mTLS/Kerberos solutions, without the control server needing to care), share users in any manner they wish.

The upside is, the control server becomes a lot simpler and a lot more flexible. The downside is, scripting one's own DB access and the like is easier to screw up than relying on something shipped with the control server, and now the control server really needs to get the admin-facing API right. I think the former can be balanced out to a high degree by providing high quality samples and docs, but it is still more work for the user.

Jun 12 '22 13:06 enoperm

I definitely see the need for multisite/multiregion deployments for redundancy purposes (like nebula lighthouses).

One site dying, shouldn't take down the communication for the rest of them.

Mar 24 '23 02:03 ciroiriarte

I would welcome it as well. Our admin stack is struggling with outages and the headscale VM is also crashing. High availability would be extremely desirable. We could have one headscale server in Canada and the second one in Switzerland. And if one of them goes down, for whatever reason, we can still continue to work. So far our master admin always has to fix the whole thing and work with a proxy that is not secured. We would like to prevent that. While Headscale is down, clients that have restarted can't connect and the work has to be down.

Apr 13 '23 16:04 0n1cOn3

Hi @0n1cOn3 :)

The main objective of Headscale is to provide a correct implementation of the Tailscale protocol & control server - for hobbyists and self-hosters. We might work in the future to support HA setups, that's not the very short term goal.

Those kinds of requests I would recommend you the official Tailscale.com SaaS + Tailnet Lock.

Or send us a PR :) PRs are always welcomed!

Apr 14 '23 08:04 juanfont

Hi @juanfont

Thanks for your answer. Yes, we (n64.cc) are do self-hosting and wont reliable on others "computers".

Maybe I'm just asking too much 😂 I'm unfortunately not able to program, otherwise I would very much like to implement somehow and make a PR. But as a hobby system / cloud administrator, I'm almost left to the others who can program.

Apr 14 '23 08:04 0n1cOn3

While we appreciate the suggestion, it is out of scope for this project and not something we will work for now.

May 10 '23 14:05 kradalby

May 10 '23 16:05 ciroiriarte

Thanks for the answer @kradalby

Too bad, because HA for Tailscale would certainly be a groundbreaking possibility. Our community has unfortunately the problem that the main server with Tailscale random again and again says goodbye and therefore this idea arose. I would welcome it if this idea is implemented in a later time perhaps.

Thank you very much.

May 10 '23 20:05 0n1cOn3

Maybe there would be the possibility, if Tailscale is down, that a client as standby can take over this task for authentication in a temporary period. At least for the already logged in clients. Only, I see some challenges in addressing this.

May 10 '23 20:05 0n1cOn3

@0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover.

Jun 26 '23 14:06 gucki

@0n1cOn3 What about using two vms in different availability zones/ datacenters, a floating/ virtual IP for headscale, and a local postgres master/ slave setup. Use keepalive to control the failover.

floating IPs work only in the same network. anyways, since clients use an FQDN to connect all you really need is:

a health checker endpoint for headscale
a cron job to sync database and configuration over to your slave server
the headscale domain to be hosted to some DNS hoster with API access
a script on both headscale servers to monitor the health of the other and change the A record of your headscale domain accordingly