Use different source ports for DNS queries
One possible issue related to #6265 is that we are using the same source port in a connlib session to talk to everything (STUN, Gateways, TURN, DNS).
On some NAT implementations, this increases the likelihood that we exhaust the NAT's connection tracking table if it is especially sensitive to source tuple reuse for different destinations. In most cases, applications on a host will be using different source ports to talk to different destinations, so it's conceivable how overloading the source for multiple destination tuples may present an issue in certain NAT implementations.
This is the current working theory behind the VirtualBox-specific issues seen by @conectado in #6265.
Instead, it might be wise to consider picking new source IP for DNS and channel bindings at least.
Curious to hear @thomaseizinger's thoughts on this.
One experiment we can run is to packet sweep a destination IP range with the same source port from within a VM guest and observe the NAT-mapped port on the host with WireShark, to see at which point if any the source ports roll over.
A similar experiment could be executed using a Docker container.
Curious to hear @thomaseizinger's thoughts on this.
I had a similar idea. Depending on how many different DNS servers we have, with #6181 we are now sending a lot more packets from the same source-port.
Instead, it might be wise to consider picking new source IP for DNS and channel bindings at least.
In theory, we can split packets across sockets in the following groups:
- STUN + p2p
- DNS queries
- TURN (allocations + channel bindings)
However, it would be much easier to just split off the DNS queries and have STUN, TURN and p2p share the same IP + port like we already did for basically ever.
Moving the DNS queries to a new socket would be pretty trivial, we just need to introduce different kinds of datagram types to differentiate.
If this is the problem, then this issue can also come up if a customer has many gateways and it might present a problem for https://github.com/firezone/firezone/issues/6109 where we would talk to a lot more relays.
The DNS servers may not be the bottle neck here. Instead, we should maybe allocate a new socket for each relay (which will give us different server-reflexive addresses for each one). We already select a single relay per connection so this would mean #6109 would scale well with that.
We would still talk to multiple endpoints using one socket (i.e. all gateways that happen to select the same relay for their connection). Currently, that algorithm uses randomness so given our 2 relays that we use, it should be split equally across both of them.