firezone Don't use IPv6 DNS upstreams when there's no IPv6 connectivity

If an admin specifies an IPv6 upstream, clients without IPv6 will fail to connect to it and time DNS queries out. The OS (macOS in this case) may never mark that server as unresponsive, causing DNS resolution to fail for the user.

Source: https://firezonehq.slack.com/archives/C069H865MHP/p1724093936014149

Aug 19 '24 19:08 jamilbk

Assigning @thomaseizinger to confirm whether forwarding change recently could affect this.

Aug 19 '24 19:08 jamilbk

I have IPv4-only due to my ISP just tested setting only an ipv6 server in the portal and that breaks the connection. But as soon as an ipv4 server is added everything works normally.

Can it be the case that only an ipv6 DNS server was set?

Aug 19 '24 19:08 conectado

Assigning @thomaseizinger to confirm whether forwarding change recently could affect this.

No, we previously already assigned them 1-to-1, creating a sentinel IPv4 DNS server for every configured IPv4 one and same for IPv6.

But we do know that missing IPv4 or IPv6 DNS servers set upstream will break things and we agreed to solve it on the portal side: https://github.com/firezone/firezone/issues/5115

Aug 19 '24 23:08 thomaseizinger

We can't control which DNS server is picked by the operating system. We create a 1-to-1 mapping for each configured DNS server.

We could start doing NAT64 or NAT46 for queries to upstream DNS servers but that seems unnecessary? I'd recommend admins to configure at least an IPv4 DNS server and ideally also an IPv6 DNS server for the case where a client is on IPv6-only.

Aug 20 '24 00:08 thomaseizinger

@jameswinegar - if you wouldn't mind, could you confirm whether you had both IPv4 and IPv6 DNS upstream servers defined when this issue occurred? Or only IPv6?

If the former this can be fixed with better docs/helptext, if the latter the fix will be more involved.

Aug 20 '24 15:08 jamilbk

Feedback from customer is that he had both IPv4 and IPv6 upstreams defined when this issue occurred.

I have IPv4-only due to my ISP just tested setting only an ipv6 server in the portal and that breaks the connection. But as soon as an ipv4 server is added everything works normally.

Can it be the case that only an ipv6 DNS server was set?

The behavior here seems platform-dependent. It seems to be the case that macOS will try to use both DNS upstreams and it is not deterministic as to which one it settles on.

A first attempt at solving this could be to condition on the quinn-udp error? What I'd like to avoid is building state to track which upstreams are problematic and avoiding those for use, similar to libc.

Aug 20 '24 15:08 jamilbk

I believe what may have happened here is that hickory was handling this for us? This would explain why this is now a recent issue even though customer's env has not changed.

Related: #6371

Aug 20 '24 17:08 jamilbk

If we don't have a valid IPv4 or IPv6 socket we shouldn't advertise these as sentinels.

Aug 20 '24 23:08 jamilbk

If we don't have a valid IPv4 or IPv6 socket we shouldn't advertise these as sentinels.

Unfortunately, that isn't as easy to determine as we thought it is. A whether or not a socket is "valid" essentially depends on whether or not we can route a packet to a particular host. So for DNS queries, it kind of depends on whether or not we can reach that particular DNS server.

I am thinking that #6428 might be the better solution here. We blackhole the ICMP error right now when in reality, we should map errors 1-to-1 as best as possible:

Map each DNS response (we do that today)
Don't receive any kind of error => Don't send a response
Receive an ICMP error => Send an ICMP error to the app

That way, the behaviour with Firezone enabled should be as close as no Firezone as possible.

Aug 30 '24 01:08 thomaseizinger

The other thing we can build is some kind of circuit-breaker. If we notice that queries to a certain DNS server keep failing or it doesn't respond, we can disable it.

Aug 30 '24 01:08 thomaseizinger

Receive an ICMP error => Send an ICMP error to the app

I think in theory this makes sense, but in all my years of app development, I don't think I've ever handled an ICMP error that resulted from opening a socket. I suppose the kernel may read these and mark the DNS server unusable for us, but we should test (on macOS especially).

Could we do this maybe?

Upon receiving set_dns (from FFI or after the init), we query all DNS servers for A and AAAA for api.firezone.dev one-by-one. Then we act upon the replies in the following way:

If A or AAAA times out, log an error and continue, marking that server as unhealthy
If both are available and return valid IPs, mark it as healthy, log debug, and continue
If they return different IPs than the server before it, log a warning, saving the new query response IPs, and continue

We perform the above each time set_dns is called. This has the benefit that we can use these new IPs in PhoenixChannel if needed to update the cached IPs we resolved there at session start.

Aug 30 '24 01:08 jamilbk

Receive an ICMP error => Send an ICMP error to the app

I think in theory this makes sense, but in all my years of app development, I don't think I've ever handled an ICMP error that resulted from opening a socket. I suppose the kernel may read these and mark the DNS server unusable for us, but we should test (on macOS especially).

Could we do this maybe?

Upon receiving set_dns (from FFI or after the init), we query all DNS servers for A and AAAA for api.firezone.dev one-by-one. Then we act upon the replies in the following way:
* If A or AAAA times out, log an error and continue, marking that server as unhealthy

* If both are available and return valid IPs, mark it as healthy, log debug, and continue

* If they return different IPs than the server before it, log a warning, saving the new query response IPs, and continue
We perform the above each time set_dns is called. This has the benefit that we can use these new IPs in PhoenixChannel if needed to update the cached IPs we resolved there at session start.

I tried this path and it gets quite complicated. I think it was about 500 lines of extra code (without handling of (2) below).

It means that set_dns -> TunInterfaceUpdated is now an async operation, requiring us to build a (mini) DNS client state machine, with timeouts etc.
The DNS servers may be a CIDR resource so we need to make sure we send our DNS queries through the tunnel and not directly from the users device. It is tricky to unify these code paths because we currently only handle this when we receive the DNS query as an IP packet from the TUN interface. Whilst it is possible to re-arrange the code to perform the same logic when we generate such a query ourselves, it will require some changes.

Responding with an error is much simpler. If that doesn't fix the problem, we can always in addition also collect stats about the number of failures associated with a certain upstream DNS server and disable it after a certain threshold. Initially, we can do that for the reported ICMP errors and later, we can add other sources to that, like timeouts of forwarded queries (which we already track because we need to remember the original source socket that sent the query.

Aug 30 '24 04:08 thomaseizinger

@thomaseizinger Ah, I see. Ok, sounds good.

Maybe the combination of:

Don't use a stack if we can't bind a socket to it
Report ICMP Destination unreachable

will be enough.

Aug 30 '24 05:08 jamilbk

Noting this here just in case it's helpful

https://en.wikipedia.org/wiki/Happy_Eyeballs

Aug 31 '24 00:08 jamilbk

Moved this to backlog because we couldn't reproduce it.

Sep 11 '24 03:09 thomaseizinger

With https://github.com/firezone/firezone/pull/6999 in place, we should be able to do this quite easily. We only need to fix https://github.com/quinn-rs/quinn/issues/1971 in order to immediately fail the query so we can respond with a SERVFAIL.

Oct 10 '24 06:10 thomaseizinger

So just an update here (at least for macOS), I think we did confirm that this was the case with the particular customer:

Can it be the case that only an ipv6 DNS server was set?

I think the bulk of this issue here was solved with #6407, so I'll bump the prio down a bit.

I.e. the OS will choose not to use DNS servers that are unresponsive.

Oct 10 '24 17:10 jamilbk

Turns out this is still an issue with Windows.

See https://firezonehq.slack.com/archives/C06L41XN05T/p1729515364242069

Oct 22 '24 15:10 jamilbk

I recently submitted https://github.com/quinn-rs/quinn/pull/2017 which will allow us to fail the DNS query instantly in that case and report back SERVFAIL. Wondering if that will be enough! If not, we can always try the ICMP approach too.

Oct 22 '24 20:10 thomaseizinger