feat(gateway): resolve DNS in the background
When receiving a connection request for a DNS resource, the gateway will first resolve the domain name and only then respond with the ICE credentials needed by the client to establish the connection. This was necessary because prior to #4994, the response sent to the client included the resolved IPs for the domain name.
With #4994, the gateway only needs to resolve the domain name in order to correctly map the client's proxy IPs upon incoming traffic. As a result, we can remove the DNS resolution from the hot-path of connection setup and always directly accept a client's connection request.
Currently, if the DNS resolution fails, we never respond to the client, thus forcing it into a timeout to try and establish a new connection. Now, we always accept the connection and perform the DNS resolution in the background. If that fails, we return an empty list of IPs. This would result in the gateway not performing any translation.
The gateway already has functionality to refresh a DNS mapping if we have seen outgoing traffic for a proxy IP but no incoming traffic. We can extend this to also refresh DNS if there isn't even a mapping for the given proxy IP. This way, a failed DNS resolution as part of an allow or connection request is self-healing as soon as the client starts using the proxy IP.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| firezone | ⬜️ Ignored (Inspect) | Visit Preview | Jun 25, 2024 0:31am |
If DNS resolution on the gateway doesn't work reliably, almost nothing of firezone works. The only difference with this PR is that a failed connection (i.e. behaviour on main) is self-healing after ~20s whereas with this change the user would have to sign-out and in again in order to clear all connection state.
We can improve on this with some refactoring: The gateway already has functionality to refresh a DNS query if it sees packets for a proxy IP but no traffic for the translated IP. We can adopt this to also refresh the DNS query if we see traffic for a proxy IP and don't even have a real IP for this domain.
If DNS resolution on the gateway doesn't work reliably, almost nothing of firezone works. The only difference with this PR is that a failed connection (i.e. behaviour on main) is self-healing after ~20s whereas with this change the user would have to sign-out and in again in order to clear all connection state.
We can improve on this with some refactoring: The gateway already has functionality to refresh a DNS query if it sees packets for a proxy IP but no traffic for the translated IP. We can adopt this to also refresh the DNS query if we see traffic for a proxy IP and don't even have a real IP for this domain.
I just attempted to implement this, I think it is actually reasonably clean.
Performance Test Results
TCP
| Test Name | Received/s | Sent/s | Retransmits |
|---|---|---|---|
| direct-tcp-client2server | 233.1 MiB (-0%) | 234.3 MiB (-1%) | 238 (-4%) |
| direct-tcp-server2client | 238.5 MiB (+3%) | 239.9 MiB (+2%) | 66 (-71%) |
| relayed-tcp-client2server | 233.4 MiB (-4%) | 234.4 MiB (-4%) | 298 (-14%) |
| relayed-tcp-server2client | 232.1 MiB (-3%) | 234.0 MiB (-2%) | 565 (-7%) |
UDP
| Test Name | Total/s | Jitter | Lost |
|---|---|---|---|
| direct-udp-client2server | 500.0 MiB (+0%) | 0.05ms (-3%) | 44.03% (-7%) |
| direct-udp-server2client | 500.0 MiB (+0%) | 0.01ms (-71%) | 23.73% (+8%) |
| relayed-udp-client2server | 500.0 MiB (+0%) | 0.03ms (-48%) | 54.96% (+2%) |
| relayed-udp-server2client | 500.0 MiB (-0%) | 0.01ms (-51%) | 37.44% (+1%) |
If we wait with this PR until the new clients are released, we can make it substantially simpler by not worrying about backwards-compatibility!
This is blocked on being able to remove the backwards-compatibility layer for clients < 1.1. We could theoretically ship it earlier but it is a lot simpler to just delay this until we are happy to no longer support < 1.1 clients on newer gateways.
Will essentially be replaced by https://github.com/firezone/firezone/pull/6732.