tailscale MagicDNS interferes with non-tailscale traffic on cellular WAN (DNS64)

MagicDNS interferes with non-tailscale traffic on cellular WAN (DNS64)

Open noah-built opened this issue 3 years ago • 13 comments

Describe the bug When tailscale is running with MagicDNS enabled, and when the client laptop is hotspotted to a phone with cellular WAN, non-tailscale traffic is interrupted (presumably because of DNS resolution issues...?).

To Reproduce Steps to reproduce the behavior:

Hotspot your phone and tether your laptop to it.
If tailscale has MagicDNS enabled, and if it's running, you will intermittently be unable to access non-tailscale network resources.
sudo tailscale down and sudo systemctl stop tailscaled.service restore access to other resources.
Disabling MagicDNS (but keeping tailscale running) also restores other network access.

Expected behavior tailscale shouldn't interfere with access to other network resources. For now I've disabled MagicDNS, but it's super handy, so I'd love to be able to use it!

Screenshots Just let me know if a video would be helpful -- happy to share.

Version information:

Laptop: Ubuntu 20.04
tailscale: tailscale commit: 64a9656c01754b6652994cb3a8ef59bce1246cfc, 1.4.4
Cell phone: Pixel 5
Provider: Google Fi

Additional context A similar behavior occasionally occurs when I switch WiFi networks, but generally this isn't a problem with wireline WAN.

Feb 21 '21 02:02 noah-built

When tethered, with tailscaled stopped or Magic DNS off, what's your DNS server that your Pixel gives out to your laptop? (What's /etc/resolv.conf say?)

I wonder if this MagicDNS is covering up a DNS server on your Pixel doing 464XLAT stuff.

I have an old Pixel + a spare Google Fi data SIM lying around somewhere so I should try.

/cc @danderson

Feb 21 '21 04:02 bradfitz

This also happens with macOS and an iOS hostspot.

I tried to reproduce which combination of the following conditions causes the bug, but somehow by toggling and trying I got it working even if all are true at the same time, although I definitely had it fail while all were true before:

macOS is thethering through iOS
iOS has Tailscale enabled
macOS has "Use corporate DNS" enabled
Magic DNS is enabled
the upstream DNS server is a Tailscale IP

When it's not working, DNS requests to 100.100.100.100 time out, except for .tailscale.net requests, which work correctly.

When I disable corporate DNS, my phone hands out 172.20.10.1 as the DNS server.

Feb 21 '21 17:02 FiloSottile

@FiloSottile thank you for reproducing! Yep, I also see somewhat intermittent behavior here. Anecdotally, I think it might be worse when I'm on a lower quality cellular connection (H+ vs. LTE or 5G) -- though that could be coincidental, too.

@bradfitz, here's what I get when tethered to Pixel 5 using Google Fi (5G):

nrc@nrc-aero:~$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad

Also worth adding that the tailscale listed Nameservers may be a factor, too, not Magic DNS per se -- I turned both off as part of my workaround.

EDIT: Thinking it over more, it seems like my local tailscale client also wasn't able to initialize correctly? Before I set up the workaround, here's what I got when running tailscale status. tailscale netcheck looked normal though.

Feb 21 '21 18:02 noah-built

It says:

# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.

Can you run that too?

And then can you do a DNS lookup for a IPv4-only domain (such as bradfitz.com) and see what it returns? Is it v6-ified?

Feb 21 '21 19:02 bradfitz

Yep:

nrc@nrc-aero:~$ resolvectl status
Global
       LLMNR setting: no                  
MulticastDNS setting: no                  
  DNSOverTLS setting: no                  
      DNSSEC setting: no                  
    DNSSEC supported: no                  
          DNSSEC NTA: 10.in-addr.arpa     
                      16.172.in-addr.arpa 
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa 
                      18.172.in-addr.arpa 
                      19.172.in-addr.arpa 
                      20.172.in-addr.arpa 
                      21.172.in-addr.arpa 
                      22.172.in-addr.arpa 
                      23.172.in-addr.arpa 
                      24.172.in-addr.arpa 
                      25.172.in-addr.arpa 
                      26.172.in-addr.arpa 
                      27.172.in-addr.arpa 
                      28.172.in-addr.arpa 
                      29.172.in-addr.arpa 
                      30.172.in-addr.arpa 
                      31.172.in-addr.arpa 
                      corp                
                      d.f.ip6.arpa        
                      home                
                      internal            
                      intranet            
                      lan                 
                      local               
                      private             
                      test                

Link 6 (tailscale0)
      Current Scopes: none
DefaultRoute setting: no  
       LLMNR setting: yes 
MulticastDNS setting: no  
  DNSOverTLS setting: no  
      DNSSEC setting: no  
    DNSSEC supported: no  

Link 5 (docker0)
      Current Scopes: none
DefaultRoute setting: no  
       LLMNR setting: yes 
MulticastDNS setting: no  
  DNSOverTLS setting: no  
      DNSSEC setting: no  
    DNSSEC supported: no  

Link 3 (wlo1)
      Current Scopes: DNS                    
DefaultRoute setting: yes                    
       LLMNR setting: yes                    
MulticastDNS setting: no                     
  DNSOverTLS setting: no                     
      DNSSEC setting: no                     
    DNSSEC supported: no                     
  Current DNS Server: 192.168.36.16          
         DNS Servers: 192.168.36.16          
                      2607:fb90:806b:12ca::8e
          DNS Domain: ~.                     

Link 2 (enp2s0)
      Current Scopes: none
DefaultRoute setting: no  
       LLMNR setting: yes 
MulticastDNS setting: no  
  DNSOverTLS setting: no  
      DNSSEC setting: no  
    DNSSEC supported: no

And yes, seems like it is v6-ified:

nrc@nrc-aero:~$ ping bradfitz.com
PING bradfitz.com(2607:7700:0:1c:0:1:23f7:3475 (2607:7700:0:1c:0:1:23f7:3475)) 56 data bytes
64 bytes from 2607:7700:0:1c:0:1:23f7:3475 (2607:7700:0:1c:0:1:23f7:3475): icmp_seq=1 ttl=52 time=191 ms
64 bytes from 2607:7700:0:1c:0:1:23f7:3475 (2607:7700:0:1c:0:1:23f7:3475): icmp_seq=2 ttl=52 time=163 ms
^C
--- bradfitz.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 163.163/176.832/190.501/13.669 ms

nrc@nrc-aero:~$ host bradfitz.com
bradfitz.com has address 35.247.52.117
bradfitz.com has IPv6 address 64:ff9b::23f7:3475
bradfitz.com mail is handled by 50 aspmx3.googlemail.com.
bradfitz.com mail is handled by 30 alt2.aspmx.l.google.com.
bradfitz.com mail is handled by 10 aspmx.l.google.com.
bradfitz.com mail is handled by 40 aspmx2.googlemail.com.
bradfitz.com mail is handled by 20 alt1.aspmx.l.google.com.

However, it's worth highlighting that I'm now in SF with pretty good cellular WAN (this originally came to my team's attention while running in more remote areas in CA and possibly in Australia as well), and after flipping the various toggles here a bunch of times, I am still seeing the behavior but much less often.

I do have a screencast I recorded in a more remote area yesterday though, where the behavior was deterministic, if that would be helpful.

Feb 21 '21 20:02 noah-built

Thanks! That's enough info. It's pretty clear what's happening now.

Feb 21 '21 20:02 bradfitz

Great, thanks for taking a look so fast!

Feb 21 '21 21:02 noah-built

Just following up from Twitter:

I'm experiencing the same symptoms as @FiloSottile from a network that is an T-Mobile LTE modem -> Google WiFi router -> {macOS, iPhone} without doing any kind of tethering.

Feb 21 '21 22:02 jzelinskie

Some links:

https://sites.google.com/site/tmoipv6/464xlat
https://dan.drown.org/android/clat/
https://en.wikipedia.org/wiki/IPv6_transition_mechanism#DNS64 / https://developers.google.com/speed/public-dns/docs/dns64

Feb 22 '21 17:02 bradfitz

I think what I'm seeing might not be the same issue (in which case, happy to open a separate one).

Today I was connected through my iOS hotspot. Magic DNS was active and the upstream was 100.74.42.19.

I switched "Use corporate DNS" on.

dig example.com (as well as general connectivity) started timing out
dig foo.filippo.io.beta.tailscale.net still worked
dig @100.74.42.19 example.com timed out at first
ping 100.74.42.19 worked
dig @100.74.42.19 example.com worked after the ping
dig example.com was still broken
dig damogran.filippo.io.beta.tailscale.net still worked

Then I turned "Use corporate DNS" off, turned Magic DNS off, and turned "Use corporate DNS" back on.

As expected, dig example.com was now working, contacting 100.74.42.19.

It looks like the problem was specifically with the Magic DNS daemon contacting 100.74.42.19, but I can't explain why at first a direct query timed out as well.

My hotspot DNS was NOT doing DNS64.

Feb 27 '21 16:02 FiloSottile

@FiloSottile I think the issue you were running into is https://github.com/tailscale/tailscale/issues/2224, during this timeframe MagicDNS wasn't able to send DNS queries to 100.x.y.z addresses due to echo-killing rules in several platforms. That was fixed, using Tailscale addresses as DNS servers works on all platforms as of 1.10.x.

Leaving the issue open as some of the earlier comments appear to be a different problem possibly involving 464XLAT.

Aug 02 '21 18:08 DentonGentry

https://github.com/tailscale/tailscale/issues/1634 and https://github.com/tailscale/tailscale/issues/1377 are somewhat similar in that they need a way for MagicDNS to detect that it is in an environment where the set of upstream DNS servers it has been configured to use cannot possibly work. We might consider solving this similar to what browsers do (like with Chrome's redirect204):

periodically, when there is some other DNS lookup to be done, also lookup a name where we absolutely know what the answer is supposed to be
if the answer comes back different, use that to figure out if we're in a captive portal environment or DNS64 environment

Feb 01 '22 20:02 DentonGentry

We face this problem, and settled on RFC7050 well-known address, ipv4only.arpa, to figure out if underlying networks rely on DNS64 to NAT v4 traffic over v6 (ref).

Sep 16 '22 07:09 ignoramous

tailscale tailscale copied to clipboard

MagicDNS interferes with non-tailscale traffic on cellular WAN (DNS64)

tailscale
tailscale copied to clipboard