talos icon indicating copy to clipboard operation
talos copied to clipboard

Split dns on talos machine config

Open btrepp opened this issue 2 years ago • 16 comments

Feature Request

Allow configuring certain domains to be forwarded to other DNS resolvers.

Description

I've been developing a Tailscale extension to allow talos nodes to have Tailscale IPs (and the long term goal is to talk to backend services such as storage, over a Tailscale network).

https://github.com/siderolabs/extensions/pull/154

One of the issues is that it would be great to uses tail scales magic dns, so you can do things like 'nas' in your config files and dns will point you to the correct Tailscale machine.

Tailscale includes this, however it tries to write over /etc/resolv.conf. This works great if I bind mount it, but when things go wrong, they go really wrong.

  • Ideally a feature would be being able to configure this on the Machine Config files, so that talos is in control of DNS.
  • A workaround might be running a DNS server as an extension, and configuring machine configs to forward to this... much like how Tailscale runs, but, if this container is stopped, dns would stop, which is the path for upgrades currently (stop all services, pull images). Which wouldn't work.
  • The other option might be being able to mark some services extensions as critical for networking, so they get rebooted/stopped at different times, in order to still be able to perform the update.

Current workaround

At the moment you can run a DNS server externally and configure how you wish, but it does become more external infrastructure you need to maintain. Alternatively you can use your Tailscale IPs directly, but then you do have to make sure the IPs are aligned (and if talos wipes a disk, you are getting a new IP from Tailscale).

btrepp avatar May 28 '23 01:05 btrepp

Long-term I feel we should have system extensions which are critical and run always, and probably have a way to override/inject values into resolv.conf, but many pieces are missing at the moment.

For the registry endpoint, you can use registry mirror config to resolve it to a Tailscale IP, as these are assigned in a static way.

smira avatar May 29 '23 10:05 smira

@btrepp Maybe you can clear up my confusion.. I appear to be able to use Split DNS with the extension. However, I'm running Talos in a VM on a host machine that is itself part of the tailnet. Could this be the reason Split DNS works, because DNS queries are forwarded outside of the VM to the host's DNS, which is configured with Split DNS?

Search Domains is the feature that fails, presumably because it requires edits to /etc/resolv.conf, even if it's running in said VM.

I create CP nodes named cp-0 with the tailscale extension and set the Kubernetes endpoint to be cp.ts. I've got CoreDNS running outside of Talos configured to answer with a CNAME pointing to cp-0.my-tailnet.ts.net when queried for cp.ts. This CoreDNS is configured for .ts using Split DNS. Everything seems to work... Is it going to go horribly wrong at some point, assuming I keep the VM on a host in the tailnet?

It's when I configure Search Domains for ts and use cp as the Kubernetes endpoint that something seems wrong, namely that although everything seems Healthy and the node is Ready, the node can't reach the API server at cp. Perhaps I could even configure libvirt's dnsmasq to include the search domain...

michaelbeaumont avatar Aug 21 '23 00:08 michaelbeaumont

Yep. I think basically dns will go up the stack.

For me. It's metal Talos -> router For you it would be Talos -> vm host.

As the extension runs in a container. It doesn't change the Talos Configs. I did experiment with modifying resolve.conf but ended up having a bad time with it.

On Mon, 21 Aug 2023, 08:37 Mike Beaumont, @.***> wrote:

@btrepp https://github.com/btrepp I appear to be able to use Split DNS with the extension. However, I'm running Talos in a VM on a host machine that is itself part of the tailnet. Could this be the reason Split DNS works, because DNS queries are forwarded outside of the VM to the host's DNS, which is configured with Split DNS?

Search Domains is the feature that fails, presumably because it requires edits to /etc/resolv.conf, even if it's running in said VM.

I create CP nodes named cp-0 with the tailscale extension and set the Kubernetes endpoint to be cp.ts. I've got CoreDNS running outside of Talos configured to answer with a CNAME pointing to cp-0.my-tailnet.ts.net when queried for cp.ts. This CoreDNS is configured for .ts using Split DNS. Everything seems to work... Is it going to go horribly wrong at some point, assuming I keep the VM on a host in the tailnet?

It's when I configure Search Domains for ts and use cp as the Kubernetes endpoint that something seems wrong, namely that although everything seems Healthy and the node is Ready, the node can't reach the API server at cp. Perhaps I could even configure libvirt's dnsmasq to include the search domain...

— Reply to this email directly, view it on GitHub https://github.com/siderolabs/talos/issues/7287#issuecomment-1685452175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAGFIOIBIUELN6HHHR6HDXWKUTPANCNFSM6AAAAAAYRQAZJA . You are receiving this because you were mentioned.Message ID: @.***>

btrepp avatar Aug 21 '23 09:08 btrepp

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 29 '24 01:06 github-actions[bot]

This would definitely still be a great feature!

michaelbeaumont avatar Jun 29 '24 17:06 michaelbeaumont

now that host-dns exists, maybe this is now possible to implement?

rgl avatar Aug 15 '24 17:08 rgl

It should work in main now with the Tailscale DNS endpoint being the first entry in nameservers and your recursive DNS resolver being the second.

smira avatar Aug 15 '24 17:08 smira

does that mean that Allow configuring certain domains to be forwarded to other DNS resolvers. is in main already (and not tied to tailscale)?

rgl avatar Aug 15 '24 17:08 rgl

I don't know what you're talking about, sorry. I have no idea about Tailscale, all I said is that split DNS should work in main now.

smira avatar Aug 15 '24 17:08 smira

I do not known about tailscale either, since you were the one mentioning it, I wanted to clarify whether this feature was tied to tailscale. By your answer, I will assume, it's not tied to tailscale. :-)

How do I configure this? The 1.8 docs at https://www.talos.dev/v1.8/talos-guides/network/host-dns/ do not seem to mention how to configure this feature.

rgl avatar Aug 15 '24 17:08 rgl

There is no feature at all, it will just correctly iterate over nameservers configured in case if one returns NXDOMAIN/SERVFAIL.

smira avatar Aug 15 '24 19:08 smira

@smira AFAICT this doesn't happen with NXDOMAIN https://github.com/siderolabs/talos/blob/7edcbbb833fc56b054ce9ecebc3416f676a51851/internal/pkg/dns/dns.go#L147 assuming we're talking about https://github.com/siderolabs/talos/pull/9179

Is there anything standing in the way of just switching to coredns for node DNS as a separate service?

It's not possible to workaround this either because the order of resolvers doesn't appear to be totally under the users control:

https://github.com/siderolabs/talos/blob/7edcbbb833fc56b054ce9ecebc3416f676a51851/internal/app/machined/pkg/controllers/network/dns_resolve_cache.go#L158-L172

My router DNS seems to always show up first in the list, probably because it comes from DHCP before the machine config is applied.

michaelbeaumont avatar Sep 04 '24 17:09 michaelbeaumont

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

smira avatar Sep 04 '24 17:09 smira

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

I do agree, just wanted to make it clear it doesn't work with NXDOMAIN, only SERVFAIL.

I think the issue is that Tailscale uses <machine-name>.<network-name>.ts.net as FQDNs but only returns records on its network-internal resolver. Since .ts.net is a real domain, Cloudflare, for example, will return NXDOMAIN. But the network-internal resolver returns the machine IP on the TS overlay network.

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1
;; AUTHORITY SECTION:
ts.net.			300	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.

;; Query time: 20 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2
;; ANSWER SECTION:
my-machine.my-network.ts.net. 600	IN	A	100.90.80.70

;; Query time: 0 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

It doesn't, from my testing.

EDIT: removed irrelevant code refs

What I see:

❯ talosctl get resolverspec -o yaml
metadata:
    namespace: network
    type: ResolverSpecs.net.talos.dev
    id: resolvers
spec:
    dnsServers:
        - fd7a:115c:a1e0::53
        - 192.168.0.1
    layer: configuration
$ dig @fd7a:115c:a1e0::53 my-machine.my-network.ts.net
my-machine.my-network.ts.net. 600	IN	A	100.90.80.70
$ dig @169.254.116.108 my-machine.my-network.ts.net
ts.net.			10	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.
$ dig @192.168.0.1 my-machine.my-network.ts.net
ts.net.			10	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.

michaelbeaumont avatar Sep 04 '24 18:09 michaelbeaumont

Probably it makes sense to create issues with full description for both, as I don't quite understand your case.

Your tailnet resolver should come before CloudFlare one.

DNS servers should be completely changeable with meachine config.

smira avatar Sep 04 '24 18:09 smira

Just a heads up, since #9310

order of resolvers doesn't appear to be totally under the users control

Is no longer true. So this should be fixed now? By that I mean that with recent PRs second workaround from the original issue should work probably.

DmitriyMV avatar Oct 13 '24 00:10 DmitriyMV

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Apr 11 '25 02:04 github-actions[bot]

So this should be fixed now?

I'm not quite sure. This is (originally) about the way that Tailscale manages DNS itself:

One of the issues is that it would be great to uses tail scales magic dns, so you can do things like 'nas' in your config files and dns will point you to the correct Tailscale machine.

Tailscale includes this, however it tries to write over /etc/resolv.conf.

It's been a while since I looked at this but

It should work in main now with the Tailscale DNS endpoint being the first entry in nameservers and your recursive DNS resolver being the second.

IIRC this caused problems with Tailscale trying to use its own DNS endpoint? Since those are the resolvers Tailscale uses as well. There's probably overlap with https://github.com/tailscale/tailscale/issues/12760 here too.

michaelbeaumont avatar Apr 11 '25 08:04 michaelbeaumont

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 10 '25 02:10 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Oct 15 '25 02:10 github-actions[bot]