Support DoH resolvers
- [ ] Portal changes to support well-known DoH upstreams
- [ ] Connlib changes
refs #4667
refs #4249
@thomaseizinger No major urgency on this, just getting on your radar since you're already in the DNS upstream headspace. We've been requested this from customers a couple times, but the more I think about it the more I realize how essential it is for our architecture, since signing into Firezone effectively means you only get UDP/53 for your whole system regardless of what level of DNS security you were enjoying prior.
Currently, connlib only supports DNS over UDP/53 which means we are bound a certain maximum size of DNS responses that we can send to the OS / application. If we start doing DoH, the upstream DNS server may return a response back to us that is bigger than what we can handle.
This is a common problem for DNS servers which is why DNS has the TC bit which allows a server to set, whether it had to truncate the response. Generally, it is advised that responses with a TC bit set should be discarded and the client should retry over TCP instead: https://www.rfc-editor.org/rfc/rfc5966#section-3. The measurements in https://blog.apnic.net/2024/07/15/revisiting-dns-and-udp-truncation/ suggest that most systems are following this recommendation:
Some 97.33% of systems will re-query over TCP if they receive a UDP response with the TC bit. This is kind of a problem because we currently don't support DNS over TCP and it requires us to implement a TCP state machine.
Most DNS queries however will fit in the UDP response payload. To get some implementation of DoH out the door, I think we can do the following:
- When DoH is active, forward each DNS query via DoH.
- If the response fits, just write it back.
- If the response does not fit, log a warning / error and reply with a server-failure (this would be awesome to record in Sentry, so we get an idea of how many customers run into this issue)
At a later point, we can implement a TCP state machine and also answer DNS queries over TCP.
An alternative implementation could be to truncate the response but not set the TC bit to avoid re-querying over TCP. However, I am kind of worried that this will break application in other subtle ways. I'd rather fail the entire DNS query instead
For DNS servers that are IP/CIDR resources, I'd suggest that DoH should not have any effect for now. To implement that, we'd have to write a client-side TCP state machine too (answering TCP queries only requires the server-half). My suggestion would be that we should detect that in the portal: If you configure your usptream DNS servers as resources, enabling DoH should be greyed out in the portal.
Just wanted to say nice work finding real-world data on how this behaves in the wild.
If you configure your usptream DNS servers as resources, enabling DoH should be greyed out in the portal.
Related: #4249
Related: #4249
The API between the portal & connlib could be something like:
{
"dns_transport": "udp / https" // udp would be default value
}
This would be part of the "interface", i.e. next to where we receive the dns servers. It would apply to all upstream DNS servers (never locally set ones).
For the portal, I think it would make sense to start with a fixed set of servers that we know supports DoH. We can always let customers specify more later.
The scope for now would be:
- No support for DNS over TCP in connlib's resolver
- Check size of response, map back into UDP packet if possible
- Respond with server error if not possible
Out-of-scope:
- DNS responses bigger than what UDP DNS allows
- DoH as a resource
If we agreed that this would be useful to build, I can start on that.
Yes, I think this all makes sense. So, then:
- Custom upstream resolvers are limited to plain UDP. It's important we continue to support these as Resources because that's the only path to a customer having protected DNS queries for their own upstreams.
- Support the well-known, public DoH resolvers, allowing the customer to select them as upstreams.
How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask 1.1.1.1 for the A record for cloudflare-dns.com (https://cloudflare-dns.com/dns-query) for example.
How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask
1.1.1.1for the A record forcloudflare-dns.com(https://cloudflare-dns.com/dns-query) for example.
Surely there must be a standardised approach? Can the portal resolve it for us? Or are there maybe well-known IPs?
How do we resolve the DoH hostname initially? One option would be to use that service's UDP resolver to do it. So, ask
1.1.1.1for the A record forcloudflare-dns.com(https://cloudflare-dns.com/dns-query) for example.Surely there must be a standardised approach? Can the portal resolve it for us? Or are there maybe well-known IPs?
I suppose you could argue the service's advertised IPs serve as well-knowns?
I've done some research and it appears the recommended approach is to use the system resolver to initially resolve the DNS server hostname.
The issue here is that by the time we receive the DNS config from the portal, connlib is already up and so any attempt to resolve DNS using the system APIs will just route back to us.
We could intercept that but then we'd have to build our own UDP DNS client with retries and stuff.
In connlib, we already have a concept of "known hosts", i.e. a static mapping of domain names to IPs that is always consulted first before attempting to resolve a name.
We could tap into that abstraction and have the portal send us "fixed DNS records" on every init. This would include the DNS records of all DoH servers. When DoH is active, connlib would then consult this map for the target IP and establish an HTTPS connection to it.
- Support the well-known, public DoH resolvers, allowing the customer to select them as upstreams.
This will be a UI-only limitation, right? For the actual API between connlib and the portal, we will simply send domain-names, right? In fact, I think the DoH-flag should be separate from the list of DNS servers.
Also, I am afraid our current representation of DNS servers in the API is not forwards-compatible with new "kinds" of DNS servers: https://github.com/firezone/firezone/blob/3e30bab965a2f297e087db60b3396eb124f4ae29/rust/connlib/shared/src/messages.rs#L253-L257
(Same issue we've had with the Internet Resource type; new variants aren't allowed by default)
Plus, we won't allow a mix of UDP DNS servers and DoH servers, right? It is either one or the other so it doesn't really make sense to differentiate this on an item-level of the upstream_dns field.
We should split this such that we have different fields for the different kinds of DNS configuration. For example:
{
"upstream_dns": [], // legacy field
"upstream_udp_dns": [
{
"ip": "1.1.1.1",
"port": 53
}
],
"upstream_doh": [
{
"url": "https://cloudflare-dns.com/dns-query"
}
]
}
We would parse this such that DoH takes priority over UDP DNS servers (meaning the above example wouldn't make much sense unless we also allow mixing DNS server configuration).
I've done some research and it appears the recommended approach is to use the system resolver to initially resolve the DNS server hostname.
These don't really change though right? Would we use the portal to resolve these on a periodic internal and send the mapping down to the client?
I suppose that could fail if there is a network partition and since the resolution is centralized it would affect all users.
Would it not be more resilient to resolve the initial DoH host from the client using the advertised IPs of the public dns services themselves?
Correct that selecting the upstream will have only the following exclusive user-facing choices:
- custom-udp (list)
- public-doh provider (single)
I suppose that could fail if there is a network partition and since the resolution is centralized it would affect all users.
Would it not be more resilient to resolve the initial DoH host from the client using the advertised IPs of the public dns services themselves?
It creates a fair amount of work because we now have to be build a reliable DNS resolver and integrate that state machine into connlib, making the entire process of updating DNS servers async. It is doable but not an easy way of managing state.
My assumption would be that a name resolution in the portal per-request would hit some local DNS cache and thus is not very expensive.
Note down design decisions from standup:
- When Internet Resource is enabled ("Internet Security"), all queries should ultimately go to Internet Resource gateways, unless DoH is enabled - add a warning to the portal if they select this
- If custom upstream resolvers are selected, they'll be sent through the tunnel if Internet Resource is enabled (or an IP or CIDR resource matches)
- If nothing is enabled (client system default), then DNS queries get routed outside the tunnel even if the Internet Resource is enabled, because the client's system DNS is not reachable by the gateway
- Block adding the DNS Resources that match the address of public DoH resolvers
portal changes
See https://github.com/firezone/firezone/pull/6905#issuecomment-2403605014
My assumption would be that a name resolution in the portal per-request would hit some local DNS cache and thus is not very expensive.
Resolution for this: Hardcode the list of IPs in the code.