connlib intermittently panics when roaming networks
Happened just now as I was trying to reload GitHub.com -- it was having issues, then resolved, then crashed. Might be related #4726 but opening as a new issue in case it's not.
Last few lines of tunnel.log (Swift-side) and connlib.log (Rust-side) are attached.
Apple client is main -- 1.0.2 build 1713596227
tunnel.log
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-22T14:35:37.658Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-22T15:10:18.403Z"}
{"category":"tunnel","message":"CallbackHandler.onDisconnect: Panicked with a non-string payload","severity":"DEBUG","time":"2024-04-22T16:08:05.240Z"}
{"category":"tunnel","message":"onDisconnect(error:)","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"Adapter state changed to: tunnelStopped","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"stopTunnel: Reason: Connection failed","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
{"category":"tunnel","message":"Adapter.stop","severity":"DEBUG","time":"2024-04-22T16:08:05.241Z"}
connlib.log
{"time":"2024-04-22T16:07:35.830459Z","target":"quinn_udp","logging.googleapis.com/sourceLocation":{"file":"/Users/runner/.cargo/git/checkouts/quinn-b26c2db1d34f0cc5/88f48b0/quinn-udp/src/lib.rs","line":"136"},"severity":"WARNING","message":"sendmsg error: Os { code: 49, kind: AddrNotAvailable, message: \"Can't assign requested address\" }, Transmit: { destination: [2600:1900:4020:eb:0:5f::]:56928, src_ip: Some(2600:1700:3ecb:2410:44b:b5aa:8e7c:8082), enc: None, len: 88, segment_size: None }"}
{"time":"2024-04-22T16:07:35.888093Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x9E46DA661DA46D17F1D879C0)","method":"channel bind","rtt":"70.197375ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x9E46DA661DA46D17F1D879C0)","method":"channel bind","rtt":"70.197375ms","name":"handle_input"}],"severity":"DEBUG","channel":"16404","message":"Bound channel","peer":"35.243.166.33:56928"}
{"time":"2024-04-22T16:07:35.896052Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x15C0E2B43104612F52518A56)","method":"channel bind","rtt":"70.152125ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x15C0E2B43104612F52518A56)","method":"channel bind","rtt":"70.152125ms","name":"handle_input"}],"severity":"DEBUG","channel":"16405","message":"Bound channel","peer":"[2600:1900:4020:eb:0:5f::]:56928"}
{"time":"2024-04-22T16:07:36.03579Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0xE40205A1C263D2EF59DF5964)","method":"channel bind","rtt":"69.404125ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0xE40205A1C263D2EF59DF5964)","method":"channel bind","rtt":"69.404125ms","name":"handle_input"}],"severity":"DEBUG","channel":"16406","message":"Bound channel","peer":"35.243.166.33:60081"}
{"time":"2024-04-22T16:07:36.085507Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x1448EDA68BDB17B62EB3F122)","method":"channel bind","rtt":"69.851541ms","name":"handle_input"},"spans":[{"from":"35.243.166.33:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x1448EDA68BDB17B62EB3F122)","method":"channel bind","rtt":"69.851541ms","name":"handle_input"}],"severity":"DEBUG","channel":"16407","message":"Bound channel","peer":"[2600:1900:4020:eb:0:5f::]:60081"}
{"time":"2024-04-22T16:07:36.202363Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0xB73C6D8BDA98EF7A96C37C16)","method":"channel bind","rtt":"136.361667ms","name":"handle_input"},"spans":[{"from":"[2600:1900:40c0:3149:0:59::]:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0xB73C6D8BDA98EF7A96C37C16)","method":"channel bind","rtt":"136.361667ms","name":"handle_input"}],"severity":"DEBUG","channel":"16388","message":"Bound channel","peer":"35.197.253.13:51370"}
{"time":"2024-04-22T16:07:36.252666Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"975"},"span":{"class":"success response","id":"TransactionId(0x17857DB32A3029A93B918B3B)","method":"channel bind","rtt":"136.591333ms","name":"handle_input"},"spans":[{"from":"[2600:1900:40c0:3149:0:59::]:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x17857DB32A3029A93B918B3B)","method":"channel bind","rtt":"136.591333ms","name":"handle_input"}],"severity":"DEBUG","channel":"16389","message":"Bound channel","peer":"[2600:1900:40c0:3149:0:59::]:51370"}
{"time":"2024-04-22T16:07:55.786459Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"443"},"span":{"relay":"[2600:1900:4150:31e:0:59::]:3478","name":"handle_timeout"},"spans":[{"relay":"[2600:1900:4150:31e:0:59::]:3478","name":"handle_timeout"}],"severity":"DEBUG","message":"Allocation is due for a refresh"}
{"time":"2024-04-22T16:07:55.960972Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"353"},"span":{"class":"success response","id":"TransactionId(0x0B8A582626FB5086832224A3)","method":"refresh","rtt":"175.435292ms","name":"handle_input"},"spans":[{"from":"[2600:1900:4150:31e:0:59::]:3478","num_bytes":"28","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x0B8A582626FB5086832224A3)","method":"refresh","rtt":"175.435292ms","name":"handle_input"}],"severity":"INFO","lifetime":"Lifetime(600s)","message":"Updated lifetime of allocation","relayIp4":"Some(Candidate(relay=34.88.203.73:58962/udp prio=50331391))","relayIp6":"Some(Candidate(relay=[2600:1900:4150:31e:0:59::]:58962/udp prio=50331647))","srflx":"Some(Candidate(srflx=[2600:1700:3ecb:2410:1d6f:8a5:e5b2:1923]:55465/udp prio=1694498815))"}
I may need to turn up the debugging level. I'm not sure the connlib logs captured anything useful.
cc @thomaseizinger thinking maybe the DB ID thing may be causing an invalid lookup or something somewhere. Will try to capture a better log.
@thomaseizinger Got the error this time:
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:22.823Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:22.985Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:23.088Z"}
{"category":"tunnel","message":"didReceivePathUpdate(path:): Detected network change: Online.","severity":"DEBUG","time":"2024-04-26T21:19:23.239Z"}
{"category":"tunnel","message":"CallbackHandler.onDisconnect: Connlib panicked: to only nominate existing candidates","severity":"DEBUG","time":"2024-04-26T21:33:55.843Z"}
{"category":"tunnel","message":"onDisconnect(error:)","severity":"DEBUG","time":"2024-04-26T21:33:55.843Z"}
{"category":"tunnel","message":"Adapter state changed to: tunnelStopped","severity":"DEBUG","time":"2024-04-26T21:33:55.844Z"}
{"category":"tunnel","message":"stopTunnel: Reason: Connection failed","severity":"DEBUG","time":"2024-04-26T21:33:55.845Z"}
{"category":"tunnel","message":"Adapter.stop","severity":"DEBUG","time":"2024-04-26T21:33:55.845Z"}
Looks like a state bug in snownet or str0m?
This was reproduced by closing my laptop lid, coming home, opening it, then connecting to my thunderbolt dock where ethernet is connected.
So,
- Sign into Firezone over WiFi at coffee shop (maybe IPv6 was unavailable)
- Put laptop to sleep
- Open it, laptop connects to home WiFi
- A few seconds later, ethernet is connected
- Firezone is signed out
Here are the last few connlib logs, though I'm not sure how useful since we don't log the onDisconnect on the way out to the FFI.
{"time":"2024-04-26T21:33:55.480173Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x6D2F98DBAA210C81A9C32987)","method":"channel bind","rtt":"89.712292ms","name":"handle_input"},"spans":[{"from":"34.148.177.140:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x6D2F98DBAA210C81A9C32987)","method":"channel bind","rtt":"89.712292ms","name":"handle_input"}],"severity":"DEBUG","channel":"16394","message":"Bound channel","peer":"[2600:1900:4020:eb:0:25::]:49535"}
{"time":"2024-04-26T21:33:55.535714Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0x9FCADDDB4CA079EAE75A54DE)","method":"channel bind","rtt":"150.528333ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0x9FCADDDB4CA079EAE75A54DE)","method":"channel bind","rtt":"150.528333ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.540087Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0xA1D95440B0C50104ADACED07)","method":"channel bind","rtt":"151.681625ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0xA1D95440B0C50104ADACED07)","method":"channel bind","rtt":"151.681625ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.54028Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"352"},"span":{"class":"error response","id":"TransactionId(0xB61EE4C6581CF857A6DDC208)","method":"channel bind","rtt":"151.845958ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"92","name":"decapsulate"},{"class":"error response","id":"TransactionId(0xB61EE4C6581CF857A6DDC208)","method":"channel bind","rtt":"151.845958ms","name":"handle_input"}],"severity":"DEBUG","error":"Unauthorized","message":"Request failed, re-authenticating"}
{"time":"2024-04-26T21:33:55.623998Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"724"},"span":{"class":"success response","id":"TransactionId(0x34B2DD5B8DBC35788ECE378B)","method":"refresh","rtt":"146.5645ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"28","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x34B2DD5B8DBC35788ECE378B)","method":"refresh","rtt":"146.5645ms","name":"handle_input"}],"severity":"INFO","lifetime":"Some((Instant { tv_sec: 104910, tv_nsec: 93044833 }, 600s))","message":"Updated allocation","relayIp4":"Some(Candidate(relay=34.39.28.213:50195/udp prio=50331391))","relayIp6":"Some(Candidate(relay=[2600:1900:40c0:3149:0:9::]:50195/udp prio=50331647))","srflxIp4":"Some(Candidate(srflx=73.93.174.109:53751/udp base=192.168.1.19:53751 prio=1694498559))","srflxIp6":"Some(Candidate(srflx=[2600:1700:3ecb:2410:352a:a13:1575:87f4]:57969/udp prio=1694498815))"}
{"time":"2024-04-26T21:33:55.681577Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x01A615D476A64D20C8377491)","method":"channel bind","rtt":"145.8815ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x01A615D476A64D20C8377491)","method":"channel bind","rtt":"145.8815ms","name":"handle_input"}],"severity":"DEBUG","channel":"16397","message":"Bound channel","peer":"107.197.104.68:53690"}
{"time":"2024-04-26T21:33:55.686749Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x545E9B33508F0419BC84027D)","method":"channel bind","rtt":"146.6605ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x545E9B33508F0419BC84027D)","method":"channel bind","rtt":"146.6605ms","name":"handle_input"}],"severity":"DEBUG","channel":"16398","message":"Bound channel","peer":"34.39.28.213:55568"}
{"time":"2024-04-26T21:33:55.686932Z","target":"snownet::allocation","logging.googleapis.com/sourceLocation":{"file":"connlib/snownet/src/allocation.rs","line":"1193"},"span":{"class":"success response","id":"TransactionId(0x03F7A73B6C8B53FAD7D0CF24)","method":"channel bind","rtt":"146.644625ms","name":"handle_input"},"spans":[{"from":"34.39.28.213:3478","num_bytes":"20","name":"decapsulate"},{"class":"success response","id":"TransactionId(0x03F7A73B6C8B53FAD7D0CF24)","method":"channel bind","rtt":"146.644625ms","name":"handle_input"}],"severity":"DEBUG","channel":"16399","message":"Bound channel","peer":"[2600:1900:40c0:3149:0:9::]:55568"}
This is a regression from https://github.com/firezone/firezone/commit/51089b89e78fb30e95b0b427c0520ff14cf64414 where we started filtering by non-discarded candidates in local_candidate:
https://github.com/firezone/firezone/blame/main/rust/connlib/snownet/src/node.rs#L1748
I think it might be a state bug in str0m. What seems to happen is that as part of roaming "back" to the old address, we don't add a new candidate because we consider it redundant however, the old one is actually marked as "discarded" so there is a state mismatch where sometimes candidates that are discarded are excluded in some parts in other they aren't.
@thomaseizinger re-adding is potentially fixed in https://github.com/algesten/str0m/pull/508 – what I don't get is "crash" – does it panic? I can't see that anything in str0m for these code paths would do anything worse than just not connecting.
@thomaseizinger re-adding is potentially fixed in algesten/str0m#508 – what I don't get is "crash" – does it panic? I can't see that anything in str0m for these code paths would do anything worse than just not connecting.
It isn't str0m that panics, we panic because we try to find a non-discarded candidate when looking through the local ones upon nomination.
We panic because I had assumed that discarded candidates never get nominated.
Re-opened since the fix for this was reverted since it prevented connections. @thomaseizinger probably a good one to tackle when you're back in full-swing.
Re-opened since the fix for this was reverted since it prevented connections. @thomaseizinger probably a good one to tackle when you're back in full-swing.
@conectado and I agreed on Friday that I'll first focus on finishing https://github.com/firezone/firezone/pull/4728 and then extend it for DNS resources to form a better test base for the DNS refactoring. @conectado will work on the implementation in the meantime.
I think I might have to come back to this as a result because I'd hope that the tests reproduce this panic :)