firezone icon indicating copy to clipboard operation
firezone copied to clipboard

connlib intermittently panics when roaming networks

Open jamilbk opened this issue 1 year ago • 9 comments

Happened just now as I was trying to reload GitHub.com -- it was having issues, then resolved, then crashed. Might be related #4726 but opening as a new issue in case it's not.

Last few lines of tunnel.log (Swift-side) and connlib.log (Rust-side) are attached.

Apple client is main -- 1.0.2 build 1713596227

tunnel.log

{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-22T14:35:37.658Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-22T15:10:18.403Z"}
{"category":"tunnel","message":"CallbackHandler.onDisconnect: Panicked with a non-string payload","severity":"DEBUG","time":"2024-04-22T16:08:05.240Z"}
{"category":"tunnel","message":"onDisconnect(error:)","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"Adapter state changed to: tunnelStopped","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"stopTunnel: Reason: Connection failed","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"Adapter.stop","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}

connlib.log

{"time":"2024-04-22T16:07:35.830459Z","target":"quinn_udp","logging.googleapis.com/sourceLocation":{"file":"/Users/runner/.cargo/git/checkouts/quinn-b26c2db1d34f0cc5/88f48b0/quinn-udp/src/lib.rs","line":"136"},"severity":"WARNING","message":"sendmsg error: Os { code: 49, kind: AddrNotAvailable, message: \"Can't assign requested address\" }, Transmit: { destination: [2600:1900:4020:eb:0:5f::]:56928, src_ip: Some(2600:1700:3ecb:2410:44b:b5aa:8e7c:8082), enc: None, len: 88, segment_size: None }"}
{"time":"2024-04-22T16:07:35.888093Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x9E46DA661DA46D17F1D879C0)","method":"channel bind","rtt":"70.197375ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x9E46DA661DA46D17F1D879C0)","method":"channel bind","rtt":"70.197375ms","name":"handle_input"}],"severity":"DEBUG","channel":"16404","message":"Bound channel","peer":"35.243.166.33:56928"}
{"time":"2024-04-22T16:07:35.896052Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x15C0E2B43104612F52518A56)","method":"channel bind","rtt":"70.152125ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x15C0E2B43104612F52518A56)","method":"channel bind","rtt":"70.152125ms","name":"handle_input"}],"severity":"DEBUG","channel":"16405","message":"Bound channel","peer":"[2600:1900:4020:eb:0:5f::]:56928"}
{"time":"2024-04-22T16:07:36.03579Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0xE40205A1C263D2EF59DF5964)","method":"channel bind","rtt":"69.404125ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0xE40205A1C263D2EF59DF5964)","method":"channel bind","rtt":"69.404125ms","name":"handle_input"}],"severity":"DEBUG","channel":"16406","message":"Bound channel","peer":"35.243.166.33:60081"}
{"time":"2024-04-22T16:07:36.085507Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x1448EDA68BDB17B62EB3F122)","method":"channel bind","rtt":"69.851541ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x1448EDA68BDB17B62EB3F122)","method":"channel bind","rtt":"69.851541ms","name":"handle_input"}],"severity":"DEBUG","channel":"16407","message":"Bound channel","peer":"[2600:1900:4020:eb:0:5f::]:60081"}
{"time":"2024-04-22T16:07:36.202363Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0xB73C6D8BDA98EF7A96C37C16)","method":"channel bind","rtt":"136.361667ms","name":"handle_input"},"spans":[{"from":"[2600:1900:40c0:3149:0:59::]:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0xB73C6D8BDA98EF7A96C37C16)","method":"channel bind","rtt":"136.361667ms","name":"handle_input"}],"severity":"DEBUG","channel":"16388","message":"Bound channel","peer":"35.197.253.13:51370"}
{"time":"2024-04-22T16:07:36.252666Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x17857DB32A3029A93B918B3B)","method":"channel bind","rtt":"136.591333ms","name":"handle_input"},"spans":[{"from":"[2600:1900:40c0:3149:0:59::]:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x17857DB32A3029A93B918B3B)","method":"channel bind","rtt":"136.591333ms","name":"handle_input"}],"severity":"DEBUG","channel":"16389","message":"Bound channel","peer":"[2600:1900:40c0:3149:0:59::]:51370"}
{"time":"2024-04-22T16:07:55.786459Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"443"},"span":{"relay":"[2600:1900:4150:31e:0:59::]:3478","name":"handle_timeout"},"spans":[{"relay":"[2600:1900:4150:31e:0:59::]:3478","name":"handle_timeout"}],"severity":"DEBUG","message":"Allocation is due for a refresh"}
{"time":"2024-04-22T16:07:55.960972Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"353"},"span":{"class":"success response","id":"TransactionId(0x0B8A582626FB5086832224A3)","method":"refresh","rtt":"175.435292ms","name":"handle_input"},"spans":[{"from":"[2600:1900:4150:31e:0:59::]:3478","num_bytes":"28","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x0B8A582626FB5086832224A3)","method":"refresh","rtt":"175.435292ms","name":"handle_input"}],"severity":"INFO","lifetime":"Lifetime(600s)","message":"Updated lifetime of allocation","relayIp4":"Some(Candidate(relay=34.88.203.73:58962/udp prio=50331391))","relayIp6":"Some(Candidate(relay=[2600:1900:4150:31e:0:59::]:58962/udp prio=50331647))","srflx":"Some(Candidate(srflx=[2600:1700:3ecb:2410:1d6f:8a5:e5b2:1923]:55465/udp prio=1694498815))"}

jamilbk avatar Apr 22 '24 16:04 jamilbk

I may need to turn up the debugging level. I'm not sure the connlib logs captured anything useful.

jamilbk avatar Apr 22 '24 16:04 jamilbk

cc @thomaseizinger thinking maybe the DB ID thing may be causing an invalid lookup or something somewhere. Will try to capture a better log.

jamilbk avatar Apr 22 '24 16:04 jamilbk

@thomaseizinger Got the error this time:

{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:22.823Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:22.985Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:23.088Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:23.239Z"}
{"category":"tunnel","message":"CallbackHandler.onDisconnect: Connlib panicked: to only nominate existing candidates","severity":"DEBUG","time":"2024-04-26T21:33:55.843Z"}
{"category":"tunnel","message":"onDisconnect(error:)","severity":"DEBUG","time":"2024-04-26T21:33:55.843Z"}
{"category":"tunnel","message":"Adapter state changed to: tunnelStopped","severity":"DEBUG","time":"2024-04-26T21:33:55.844Z"}
{"category":"tunnel","message":"stopTunnel: Reason: Connection failed","severity":"DEBUG","time":"2024-04-26T21:33:55.845Z"}
{"category":"tunnel","message":"Adapter.stop","severity":"DEBUG","time":"2024-04-26T21:33:55.845Z"}

Looks like a state bug in snownet or str0m?

This was reproduced by closing my laptop lid, coming home, opening it, then connecting to my thunderbolt dock where ethernet is connected.

So,

  1. Sign into Firezone over WiFi at coffee shop (maybe IPv6 was unavailable)
  2. Put laptop to sleep
  3. Open it, laptop connects to home WiFi
  4. A few seconds later, ethernet is connected
  5. Firezone is signed out

jamilbk avatar Apr 26 '24 23:04 jamilbk

Here are the last few connlib logs, though I'm not sure how useful since we don't log the onDisconnect on the way out to the FFI.

{"time":"2024-04-26T21:33:55.480173Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x6D2F98DBAA210C81A9C32987)","method":"channel bind","rtt":"89.712292ms","name":"handle_input"},"spans":[{"from":"34.148.177.140:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x6D2F98DBAA210C81A9C32987)","method":"channel bind","rtt":"89.712292ms","name":"handle_input"}],"severity":"DEBUG","channel":"16394","message":"Bound channel","peer":"[2600:1900:4020:eb:0:25::]:49535"}
{"time":"2024-04-26T21:33:55.535714Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0x9FCADDDB4CA079EAE75A54DE)","method":"channel bind","rtt":"150.528333ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0x9FCADDDB4CA079EAE75A54DE)","method":"channel bind","rtt":"150.528333ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.540087Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0xA1D95440B0C50104ADACED07)","method":"channel bind","rtt":"151.681625ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0xA1D95440B0C50104ADACED07)","method":"channel bind","rtt":"151.681625ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.54028Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0xB61EE4C6581CF857A6DDC208)","method":"channel bind","rtt":"151.845958ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0xB61EE4C6581CF857A6DDC208)","method":"channel bind","rtt":"151.845958ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.623998Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"724"},"span":{"class":"success response","id":"TransactionId(0x34B2DD5B8DBC35788ECE378B)","method":"refresh","rtt":"146.5645ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"28","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x34B2DD5B8DBC35788ECE378B)","method":"refresh","rtt":"146.5645ms","name":"handle_input"}],"severity":"INFO","lifetime":"Some((Instant { tv_sec: 104910, tv_nsec: 93044833 }, 600s))","message":"Updated allocation","relayIp4":"Some(Candidate(relay=34.39.28.213:50195/udp prio=50331391))","relayIp6":"Some(Candidate(relay=[2600:1900:40c0:3149:0:9::]:50195/udp prio=50331647))","srflxIp4":"Some(Candidate(srflx=73.93.174.109:53751/udp base=192.168.1.19:53751 prio=1694498559))","srflxIp6":"Some(Candidate(srflx=[2600:1700:3ecb:2410:352a:a13:1575:87f4]:57969/udp prio=1694498815))"}
{"time":"2024-04-26T21:33:55.681577Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x01A615D476A64D20C8377491)","method":"channel bind","rtt":"145.8815ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x01A615D476A64D20C8377491)","method":"channel bind","rtt":"145.8815ms","name":"handle_input"}],"severity":"DEBUG","channel":"16397","message":"Bound channel","peer":"107.197.104.68:53690"}
{"time":"2024-04-26T21:33:55.686749Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x545E9B33508F0419BC84027D)","method":"channel bind","rtt":"146.6605ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x545E9B33508F0419BC84027D)","method":"channel bind","rtt":"146.6605ms","name":"handle_input"}],"severity":"DEBUG","channel":"16398","message":"Bound channel","peer":"34.39.28.213:55568"}
{"time":"2024-04-26T21:33:55.686932Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x03F7A73B6C8B53FAD7D0CF24)","method":"channel bind","rtt":"146.644625ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x03F7A73B6C8B53FAD7D0CF24)","method":"channel bind","rtt":"146.644625ms","name":"handle_input"}],"severity":"DEBUG","channel":"16399","message":"Bound channel","peer":"[2600:1900:40c0:3149:0:9::]:55568"}

jamilbk avatar Apr 26 '24 23:04 jamilbk

This is a regression from https://github.com/firezone/firezone/commit/51089b89e78fb30e95b0b427c0520ff14cf64414 where we started filtering by non-discarded candidates in local_candidate:

https://github.com/firezone/firezone/blame/main/rust/connlib/snownet/src/node.rs#L1748

I think it might be a state bug in str0m. What seems to happen is that as part of roaming "back" to the old address, we don't add a new candidate because we consider it redundant however, the old one is actually marked as "discarded" so there is a state mismatch where sometimes candidates that are discarded are excluded in some parts in other they aren't.

thomaseizinger avatar Apr 28 '24 04:04 thomaseizinger

@thomaseizinger re-adding is potentially fixed in https://github.com/algesten/str0m/pull/508 – what I don't get is "crash" – does it panic? I can't see that anything in str0m for these code paths would do anything worse than just not connecting.

algesten avatar Apr 28 '24 09:04 algesten

@thomaseizinger re-adding is potentially fixed in algesten/str0m#508 – what I don't get is "crash" – does it panic? I can't see that anything in str0m for these code paths would do anything worse than just not connecting.

It isn't str0m that panics, we panic because we try to find a non-discarded candidate when looking through the local ones upon nomination.

We panic because I had assumed that discarded candidates never get nominated.

thomaseizinger avatar Apr 28 '24 11:04 thomaseizinger

Re-opened since the fix for this was reverted since it prevented connections. @thomaseizinger probably a good one to tackle when you're back in full-swing.

jamilbk avatar May 17 '24 21:05 jamilbk

Re-opened since the fix for this was reverted since it prevented connections. @thomaseizinger probably a good one to tackle when you're back in full-swing.

@conectado and I agreed on Friday that I'll first focus on finishing https://github.com/firezone/firezone/pull/4728 and then extend it for DNS resources to form a better test base for the DNS refactoring. @conectado will work on the implementation in the meantime.

I think I might have to come back to this as a result because I'd hope that the tests reproduce this panic :)

thomaseizinger avatar May 18 '24 11:05 thomaseizinger