terraform-hcloud-kube-hetzner icon indicating copy to clipboard operation
terraform-hcloud-kube-hetzner copied to clipboard

Add Tailscale support

Open MarekPikula opened this issue 6 months ago • 18 comments

This PR introduces support for enabling Tailscale on cluster nodes. It has the following benefits:

  1. Enables a secure, private connection to all nodes without the need for a jump host.
  2. Makes it possible to route the private IP of the control plane load balancer using the Tailscale subnet router, allowing for high-availability (HA) access to the K3s socket. This makes it possible to disable the public interface on the load balancer (LB) without the need for a jump host.
  3. Constrains the firewall to disable SSH and K3s socket access on public interfaces.

This is a compromise between a public and a fully private cluster (as proposed in #1681), which would otherwise require a bastion host – adding to the cost and complexity of the deployment while reducing the scope of HA.

The support can be dynamically enabled or disabled on an existing deployment. More details about the setup are available in the README.

MarekPikula avatar Jun 05 '25 11:06 MarekPikula

/gemini review

MarekPikula avatar Jun 05 '25 11:06 MarekPikula

That's very good stuff @MarekPikula, thanks! Please if you can fix the new merge conflicts it would be great. Then I will properly review along with @valkenburg-prevue-ch.

mysticaltech avatar Jun 09 '25 03:06 mysticaltech

@mysticaltech Thanks for merging the other PRs. This one is ready for review after a rebase.

MarekPikula avatar Jun 09 '25 12:06 MarekPikula

@MarekPikula Thanks a lot for all those good contributions, glad to see that this one is ready too.

If all is good, it will be merged by the end of the week along with some other PRs, either as part v2.18 or v3.0.

mysticaltech avatar Jun 09 '25 12:06 mysticaltech

Could this be updated to make the change less specific to tailscale? At some point in the future I'd like to run ZeroTier on each node, to do so it'd be a similar process of creating key pair, passing them into the node, and running a script to join the network. I would assume there's a similar process for CF Warp as well. Having a more generalised process of doing this by passing in something like a cloud-init script to run on each node would mean the module doesn't need to be updated to support more providers

alexandradeas avatar Jun 27 '25 10:06 alexandradeas

Good stuff! I noticed you're checking whether [var.enable_tailscale.auth_key starts with tskey-. Could this check be removed or updated with a comment noting that this prefix is specific to Tailscale? Headscale, the open-source tailscale control server alternative, doesn't use the tskey- prefix for its preauth keys, so this might cause issues/confusion for users running Headscale.

Ducktatorrr avatar Jul 23 '25 09:07 Ducktatorrr

@MarekPikula I like the approach that @alexandradeas is proposing. It would be good if this PR could be more generic, paving the way for Tailscale to work with this module while being added externally, so example code that enables Tailscale support would go in the examples folder. What do you think?

mysticaltech avatar Jul 28 '25 03:07 mysticaltech

Would it be an option to support generic "network-setup" scripts?

Then users could use anything they like for networking.

kube-hetzner then just consumes some expected output from these scripts (e.g. node-ip, cidr, ..) and uses the info to create config.yaml and installs k3s.

maaft avatar Aug 04 '25 14:08 maaft

That's the way @maaft! If anyone is inspired, please do shoot a PR, or you @MarekPikula please do consider something more generic that could work in your case with Tailscale too.

@MarekPikula I really appreciate this PR, but the all the reasons above, I will close it for now, the genericity is important, otherwise there are just too many external systems to support.

mysticaltech avatar Aug 04 '25 15:08 mysticaltech

@mysticaltech fyi, I'll try to come up with a solution the next few days.

Anything that I need to be aware of, from the top of your head?

maaft avatar Aug 05 '25 10:08 maaft

Just wanted to also let you know that I'm currently implementing something similar, achieving basically the same goals, by using Cloudflare(d) with Cloudflare's Zero Trust approach. The client in this case would be Cloudflare WARP.

I'm happy to also contribute this, maybe we can introduce a dedicated section for this since we would then already have three remote access solutions I think, being jump / bastion hosts with no public IPs on control/worker nodes; tailscale and cloudflare.

mikeywuu avatar Aug 05 '25 10:08 mikeywuu

@maaft @mikeywuu I'll let you folks discuss who wants to attempt it, ideally it's as generic as possible and uses to the max the circuitry we already have like the private ip only or even the nat router functionality if needed. But up to you guys.

The most minimal and most generic the better, making this module work with an outer module, and the outer modules can live in the examples folder for instance. Tailscale, Cloudflare, etc.

mysticaltech avatar Aug 05 '25 11:08 mysticaltech

Hi @mysticaltech why has this PR been reopened? I understand the need for a more generic solution, but I don't have time to dedicate on this at this point.

MarekPikula avatar Oct 09 '25 07:10 MarekPikula

@MarekPikula Again, thank you so much for the work you did here. So I re-opened in order to not forget, I will see if I can continue the PR myself, morph it into something that would be more generic, while still adding your tailscale support.

mysticaltech avatar Oct 09 '25 07:10 mysticaltech

@mysticaltech Thank you, sounds great!

MarekPikula avatar Oct 09 '25 08:10 MarekPikula

This is exactly what I was looking for! Can someone summarise what the required tasks are before this PR can get merged?

Also quick question:

This is a compromise between a public and a fully private cluster (as proposed in https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/1681), which would otherwise require a bastion host – adding to the cost and complexity of the deployment while reducing the scope of HA

Why did you describe this as a compromise? Isn't this a fundamentally more secure and cheaper alternative to the NAT route, whilst also avoiding the single point of failure for egress issue?

dustinmoris avatar Oct 21 '25 12:10 dustinmoris

I will have a reason to deploy a cluster using ZeroTier in ~2 weeks. If no one else is working on it by then I can incorporate an approach that is known to work for ZT and TS (but would hopefully work with minimal pain for other vendors)

alexandradeas avatar Oct 21 '25 14:10 alexandradeas

@alexandradeas Please do, I'm way over my head on time these days.

mysticaltech avatar Oct 23 '25 11:10 mysticaltech