firezone Support DoH resolvers

[ ] Portal changes to support well-known DoH upstreams
[ ] Connlib changes

refs #4667

Apr 17 '24 21:04 jamilbk

refs #4249

Jul 04 '24 17:07 jamilbk

@thomaseizinger No major urgency on this, just getting on your radar since you're already in the DNS upstream headspace. We've been requested this from customers a couple times, but the more I think about it the more I realize how essential it is for our architecture, since signing into Firezone effectively means you only get UDP/53 for your whole system regardless of what level of DNS security you were enjoying prior.

Aug 27 '24 05:08 jamilbk

Currently, connlib only supports DNS over UDP/53 which means we are bound a certain maximum size of DNS responses that we can send to the OS / application. If we start doing DoH, the upstream DNS server may return a response back to us that is bigger than what we can handle.

This is a common problem for DNS servers which is why DNS has the TC bit which allows a server to set, whether it had to truncate the response. Generally, it is advised that responses with a TC bit set should be discarded and the client should retry over TCP instead: https://www.rfc-editor.org/rfc/rfc5966#section-3. The measurements in https://blog.apnic.net/2024/07/15/revisiting-dns-and-udp-truncation/ suggest that most systems are following this recommendation:

Some 97.33% of systems will re-query over TCP if they receive a UDP response with the TC bit. This is kind of a problem because we currently don't support DNS over TCP and it requires us to implement a TCP state machine.

Most DNS queries however will fit in the UDP response payload. To get some implementation of DoH out the door, I think we can do the following:

When DoH is active, forward each DNS query via DoH.
If the response fits, just write it back.
If the response does not fit, log a warning / error and reply with a server-failure (this would be awesome to record in Sentry, so we get an idea of how many customers run into this issue)

At a later point, we can implement a TCP state machine and also answer DNS queries over TCP.

An alternative implementation could be to truncate the response but not set the TC bit to avoid re-querying over TCP. However, I am kind of worried that this will break application in other subtle ways. I'd rather fail the entire DNS query instead

Sep 09 '24 17:09 thomaseizinger

For DNS servers that are IP/CIDR resources, I'd suggest that DoH should not have any effect for now. To implement that, we'd have to write a client-side TCP state machine too (answering TCP queries only requires the server-half). My suggestion would be that we should detect that in the portal: If you configure your usptream DNS servers as resources, enabling DoH should be greyed out in the portal.

Sep 09 '24 17:09 thomaseizinger

Just wanted to say nice work finding real-world data on how this behaves in the wild.

If you configure your usptream DNS servers as resources, enabling DoH should be greyed out in the portal.

Related: #4249

Sep 11 '24 04:09 jamilbk

Related: #4249

The API between the portal & connlib could be something like:

{
    "dns_transport": "udp / https" // udp would be default value
}

This would be part of the "interface", i.e. next to where we receive the dns servers. It would apply to all upstream DNS servers (never locally set ones).

For the portal, I think it would make sense to start with a fixed set of servers that we know supports DoH. We can always let customers specify more later.

The scope for now would be:

No support for DNS over TCP in connlib's resolver
Check size of response, map back into UDP packet if possible
Respond with server error if not possible

Out-of-scope:

DNS responses bigger than what UDP DNS allows
DoH as a resource

If we agreed that this would be useful to build, I can start on that.

Sep 12 '24 04:09 thomaseizinger

Yes, I think this all makes sense. So, then:

Custom upstream resolvers are limited to plain UDP. It's important we continue to support these as Resources because that's the only path to a customer having protected DNS queries for their own upstreams.
Support the well-known, public DoH resolvers, allowing the customer to select them as upstreams.

How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask 1.1.1.1 for the A record for cloudflare-dns.com (https://cloudflare-dns.com/dns-query) for example.

Sep 13 '24 18:09 jamilbk

How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask 1.1.1.1 for the A record for cloudflare-dns.com (https://cloudflare-dns.com/dns-query) for example.

Surely there must be a standardised approach? Can the portal resolve it for us? Or are there maybe well-known IPs?

Sep 13 '24 20:09 thomaseizinger

How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask 1.1.1.1 for the A record for cloudflare-dns.com (https://cloudflare-dns.com/dns-query) for example.

Surely there must be a standardised approach? Can the portal resolve it for us? Or are there maybe well-known IPs?

I suppose you could argue the service's advertised IPs serve as well-knowns?

Sep 13 '24 20:09 jamilbk

I've done some research and it appears the recommended approach is to use the system resolver to initially resolve the DNS server hostname.

The issue here is that by the time we receive the DNS config from the portal, connlib is already up and so any attempt to resolve DNS using the system APIs will just route back to us.

We could intercept that but then we'd have to build our own UDP DNS client with retries and stuff.

In connlib, we already have a concept of "known hosts", i.e. a static mapping of domain names to IPs that is always consulted first before attempting to resolve a name.

We could tap into that abstraction and have the portal send us "fixed DNS records" on every init. This would include the DNS records of all DoH servers. When DoH is active, connlib would then consult this map for the target IP and establish an HTTPS connection to it.

Sep 14 '24 17:09 thomaseizinger

Support the well-known, public DoH resolvers, allowing the customer to select them as upstreams.

This will be a UI-only limitation, right? For the actual API between connlib and the portal, we will simply send domain-names, right? In fact, I think the DoH-flag should be separate from the list of DNS servers.

Sep 14 '24 17:09 thomaseizinger

Also, I am afraid our current representation of DNS servers in the API is not forwards-compatible with new "kinds" of DNS servers: https://github.com/firezone/firezone/blob/3e30bab965a2f297e087db60b3396eb124f4ae29/rust/connlib/shared/src/messages.rs#L253-L257

(Same issue we've had with the Internet Resource type; new variants aren't allowed by default)

Plus, we won't allow a mix of UDP DNS servers and DoH servers, right? It is either one or the other so it doesn't really make sense to differentiate this on an item-level of the upstream_dns field.

We should split this such that we have different fields for the different kinds of DNS configuration. For example:

{
    "upstream_dns": [], // legacy field
    "upstream_udp_dns": [
        {
             "ip": "1.1.1.1",
             "port": 53 
        }
    ],
    "upstream_doh": [
        {
            "url": "https://cloudflare-dns.com/dns-query"
        }
    ]
}

We would parse this such that DoH takes priority over UDP DNS servers (meaning the above example wouldn't make much sense unless we also allow mixing DNS server configuration).

Sep 14 '24 18:09 thomaseizinger

I've done some research and it appears the recommended approach is to use the system resolver to initially resolve the DNS server hostname.

These don't really change though right? Would we use the portal to resolve these on a periodic internal and send the mapping down to the client?

I suppose that could fail if there is a network partition and since the resolution is centralized it would affect all users.

Would it not be more resilient to resolve the initial DoH host from the client using the advertised IPs of the public dns services themselves?

Correct that selecting the upstream will have only the following exclusive user-facing choices:

custom-udp (list)
public-doh provider (single)

Sep 15 '24 02:09 jamilbk

I suppose that could fail if there is a network partition and since the resolution is centralized it would affect all users.

Would it not be more resilient to resolve the initial DoH host from the client using the advertised IPs of the public dns services themselves?

It creates a fair amount of work because we now have to be build a reliable DNS resolver and integrate that state machine into connlib, making the entire process of updating DNS servers async. It is doable but not an easy way of managing state.

My assumption would be that a name resolution in the portal per-request would hit some local DNS cache and thus is not very expensive.

Sep 15 '24 07:09 thomaseizinger

Note down design decisions from standup:

When Internet Resource is enabled ("Internet Security"), all queries should ultimately go to Internet Resource gateways, unless DoH is enabled - add a warning to the portal if they select this
If custom upstream resolvers are selected, they'll be sent through the tunnel if Internet Resource is enabled (or an IP or CIDR resource matches)
If nothing is enabled (client system default), then DNS queries get routed outside the tunnel even if the Internet Resource is enabled, because the client's system DNS is not reachable by the gateway
Block adding the DNS Resources that match the address of public DoH resolvers

portal changes

See https://github.com/firezone/firezone/pull/6905#issuecomment-2403605014

Oct 09 '24 22:10 jamilbk

My assumption would be that a name resolution in the portal per-request would hit some local DNS cache and thus is not very expensive.

Resolution for this: Hardcode the list of IPs in the code.

Oct 09 '24 23:10 thomaseizinger