firezone feat(connlib): discard sockets on `NetworkUnreachable` error

In connlib, we use quinn-udp to send UDP packets. In its current design, quinn-udp swallows all io::Errors upon sending and logs them as a warn (throttled to 1 / min). See https://github.com/quinn-rs/quinn/issues/1971 and linked issues for context.

In Firezone, this isn't really what we want. When connlib starts up, we try to bind two UDP sockets:

Unspecified IPv4
Unspecified IPv6

If either of those fail, that particular socket gets disabled. Binding to a particular address family only fails if we don't have a single interface that offers this address. On many machines, there are IPv6 interfaces but they may only have a link-local address or are otherwise non-functional yet we can still bind to them.

The problem only surfaces when we try to send a packet to an IPv6 destination. If there are no routes configured, the syscall will fail with NetworkUnreachable. This error is currently swallowed by quinn-udp.

To expose this error, we fork quinn-udp and bubble up all io::Errors during sending. Effectively this doesn't change anything in connlib because all errors returned from Tunnel::poll_next_event are already non-fatal. They are simply logged on warn and the event loop continues on.

Now that we can intercept the error, we utilise that knowledge to pro-actively disable the offending socket, i.e. setting it to None. This would surface our own error when trying to send a packet. Therefore, we also refactor send to log a message on trace that we are dropping the packet. connlib's state machine already logs retries of packets (like the STUN requests to the relays) on debug, hence we log this on trace to not duplicate those logs.

In case we disable both sockets as a result of this, connlib's network stack will completely suspend until we receive a reset instruction from the client. If this happens on the gateway, things are somewhat irreparably broken. However, roaming on gateways is already something we don't support very well. We don't reset any network state for example so whatever behaviour we have is not intentional. So far, there don't seem to be any complaints about this from users. When we choose to support roaming on gateways, we can build on top of this functionality and it will self-heal, similarly to what the clients do, assuming we get a notification from whatever components monitors the network interfaces.

Aug 21 '24 03:08 thomaseizinger

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
firezone	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 22, 2024 4:59am

Aug 21 '24 03:08 vercel[bot]

I was concerned this would sign out the Client if you lose Internet access, but I tested it and it seems okay.

So this is just for eliminating bogus interfaces, it won't sign us out if we have a real interface that just has no Internet connection?

Aug 21 '24 16:08 ReactorScram

Found #6389 while testing this, but it replicates on main so it's unrelated

Aug 21 '24 17:08 ReactorScram

Well, this can sign us out not when roaming but if a client disconnects the network adapter and connects another, right?

Aug 21 '24 19:08 conectado

Well, this can sign us out not when roaming but if a client disconnects the network adapter and connects another, right?

Good thinking! Let me test that.

Maybe not having interfaces is fine? Maybe we should rely on the roaming trigger to reset our interfacss again after?

Aug 21 '24 20:08 thomaseizinger

In case we disable both sockets as a result of this, connlib's network stack will completely suspend until we receive a reset instruction from the client. If this happens on the gateway, things are somewhat irreparably broken. However, roaming on gateways is already something we don't support very well. We don't reset any network state for example so whatever behaviour we have is not intentional. So far, there don't seem to be any complaints about this from users. When we choose to support roaming on gateways, we can build on top of this functionality and it will self-heal, similarly to what the clients do, assuming we get a notification from whatever components monitors the network interfaces.

@ReactorScram @jamilbk @conectado I updated this to now build on top of #6398 which should make this a lot smoother. Let me know if you are happy with the reasoning about the gateways functionality.

Aug 22 '24 02:08 thomaseizinger

Perhaps discarding the socket right away is a bit too brutal? There could be network topologies in which packets to some hosts are not reachable but to others is fine.

Our gateways and relays should always be fine but for example, the admin could enter a bogus IPv6 DNS server address in which case we would erroneously discard the socket when we first try to talk to it.

Aug 22 '24 06:08 thomaseizinger

Our gateways and relays should always be fine but for example, the admin could enter a bogus IPv6 DNS server address in which case we would erroneously discard the socket when we first try to talk to it.

Yeah, this is a good catch. 100%.

Aug 22 '24 06:08 jamilbk

I think one thing that might be handy is for Gateways to report their IPv4/IPv6 connectivity to the portal. It would only need to check for a publicly-accessible route. Most admins will just want to quickly know if they can assign IPv4 and IPv6 Resources to it safely.

I think in most cases that's what admins care about - if they are operating with an edge case like the Gateway only has local IPv6 then I would suspect they're proficient enough to hop onto the Gateway and debug connectivity further from there.

Aug 22 '24 07:08 jamilbk

I think one thing that might be handy is for Gateways to report their IPv4/IPv6 connectivity to the portal. It would only need to check for a publicly-accessible route. Most admins will just want to quickly know if they can assign IPv4 and IPv6 Resources to it safely.

I think in most cases that's what admins care about - if they are operating with an edge case like the Gateway only has local IPv6 then I would suspect they're proficient enough to hop onto the Gateway and debug connectivity further from there.

Let's discuss that on the gateway PR: https://github.com/firezone/firezone/pull/6385.

Aug 22 '24 07:08 thomaseizinger

Perhaps discarding the socket right away is a bit too brutal? There could be network topologies in which packets to some hosts are not reachable but to others is fine.

Our gateways and relays should always be fine but for example, the admin could enter a bogus IPv6 DNS server address in which case we would erroneously discard the socket when we first try to talk to it.

I am thinking of changing this PR to instead pass back an error from the Io part to connlib's state machine. We can then disable DNS servers, relays etc as a result of that event. It also means we can test this much better.

Aug 22 '24 07:08 thomaseizinger

Perhaps discarding the socket right away is a bit too brutal? There could be network topologies in which packets to some hosts are not reachable but to others is fine. Our gateways and relays should always be fine but for example, the admin could enter a bogus IPv6 DNS server address in which case we would erroneously discard the socket when we first try to talk to it.

I am thinking of changing this PR to instead pass back an error from the Io part to connlib's state machine. We can then disable DNS servers, relays etc as a result of that event. It also means we can test this much better.

👍 and this would be reset on roaming right?

Aug 22 '24 07:08 jamilbk

Perhaps discarding the socket right away is a bit too brutal? There could be network topologies in which packets to some hosts are not reachable but to others is fine. Our gateways and relays should always be fine but for example, the admin could enter a bogus IPv6 DNS server address in which case we would erroneously discard the socket when we first try to talk to it.

I am thinking of changing this PR to instead pass back an error from the Io part to connlib's state machine. We can then disable DNS servers, relays etc as a result of that event. It also means we can test this much better.

👍 and this would be reset on roaming right?

Yes. I think it is safe enough to assume that NetworkUnreachable isn't a temporary thing that randomly happens without roaming.

The more conservative approach would be to only disable DNS servers if we keep receiving timeouts. The downside of that is that it isn't instant which would hurt the UX on startup. We could proactively query them for a Firezone specific record maybe? Even if they don't have the record, getting a response means they are up. That is basically what we do for the relays with STUN bindings.

Aug 22 '24 07:08 thomaseizinger

We could proactively query them for a Firezone specific record maybe? Even if they don't have the record, getting a response means they are up. That is basically what we do for the relays with STUN bindings.

Checking them for firezone.dev A and AAAA could work. When we implement #6138 we will need to be able to resolve that service also.

Aug 22 '24 07:08 jamilbk

We could proactively query them for a Firezone specific record maybe? Even if they don't have the record, getting a response means they are up. That is basically what we do for the relays with STUN bindings.

Checking them for firezone.dev A and AAAA could work. When we implement #6138 we will need to be able to resolve that service also.

We could also maybe get rid of the static resolving of api.firezone.dev in the phoenix channel then and just resolve that domain in addition to some others.

Aug 22 '24 07:08 thomaseizinger

Closing this as "we don't want to do this". Instead, we should just forward the ICMP error to the TUN interface: https://github.com/firezone/firezone/pull/6428

Aug 31 '24 09:08 thomaseizinger