Call `on_disconnect` for most portal connect errors
If the portal host is typo'd, connlib will keep trying to connect to it and never call on_disconnect. It might be better to fail quickly on name failures to the portal so that the client is not stuck in a signed in but inoperable state:
connlib 07:18:20.591010-0700 FirezoneNetworkExtensionmacOS dev.firezone.firezone Reconnecting to portal on transient client error: websocket connection failed: IO error: failed to lookup address information: nodename nor servname provided, or not known backoff=22.666594299s max_elapsed_time=Some(86400s)
refs #5467
I think for this particular one we can just immediately call on_disconnect if the initial connection fails.
For roaming / deploys, we can call reconnect which should have retry logic. wdyt @thomaseizinger ??
I think for this particular one we can just immediately call
on_disconnectif the initial connection fails.For roaming / deploys, we can call
reconnectwhich should have retry logic. wdyt @thomaseizinger ??
It is a bit tricky. At the moment, phoenix-channel does not differentiate between setting up a connection for the first time or because we called reconnect: We always just try to connect, see: https://github.com/firezone/firezone/blob/6641e1b70f0feb42e9077d37f4ac612916518e26/rust/phoenix-channel/src/lib.rs#L220-L223
It would be better if we can unambiguously identify "fatal" errors where we never want to retry. We have these already, like when the portal returns us 401. IO errors are a bit trickier because they are somewhat opaque (and platform-dependent).
I just experimented with this and I think with the addition of resolving the portal domain before connlib starts up in https://github.com/firezone/firezone/pull/5563, this is resolved at least for typos in the domain name. I tried to set the API as foo.firez.one and got the following behaviour:
❯ sudo RUST_LOG=wire::api=trace,debug target/debug/firezone-headless-client -u wss://foo.firez.one --firezone-id $(hostname) standalone
Place your right index finger on the fingerprint reader
2024-07-04T01:32:18.844543Z INFO firezone_headless_client::standalone: git_version="gui-client-1.1.3-3-gf6e99752e-modified"
2024-07-04T01:32:18.845013Z INFO firezone_headless_client::standalone: Running in headless / standalone mode
2024-07-04T01:32:18.880697Z INFO firezone_headless_client::dns_control::linux: dns_control_method=Some(Systemd)
2024-07-04T01:32:18.881183Z DEBUG netlink_proto::connection: reading incoming messages
2024-07-04T01:32:18.881227Z DEBUG netlink_proto::codecs: NetlinkCodec: decoding next message
2024-07-04T01:32:18.881269Z DEBUG netlink_proto::connection: forwarding unsolicited messages to the connection handle
2024-07-04T01:32:18.881306Z DEBUG netlink_proto::connection: forwarding responses to previous requests to the connection handle
2024-07-04T01:32:18.881347Z DEBUG netlink_proto::connection: handling requests
2024-07-04T01:32:18.881389Z DEBUG netlink_proto::connection: sending messages
2024-07-04T01:32:21.731527Z ERROR connlib_client_shared: connlib failed: failed to lookup address information: Name or service not known
2024-07-04T01:32:21.731604Z ERROR firezone_headless_client: Got `on_disconnect` from connlib error=Io(Custom { kind: Uncategorized, error: "failed to lookup address information: Name or service not known" })
2024-07-04T01:32:21.731825Z DEBUG firezone_headless_client::dns_control::linux: Reverting DNS control...
Error: Firezone disconnected
Caused by:
failed to lookup address information: Name or service not known
I just experimented with this and I think with the addition of resolving the portal domain before connlib starts up in #5563, this is resolved at least for typos in the domain name.
Can we consider this resolved for that case @jamilbk?
fixed by #5970