Scheduled reconnection attempts stack up, presumably unbounded
Issue and Steps to Reproduce
-
I have an unpublished channel between my CLN node and a Breez wallet running on my Android phone. The mobile node cannot accept incoming connections.
-
From a fresh startup, CLN tries futilely to connect to my mobile node:
2022-10-07T23:47:25.486Z DEBUG 03…1a-lightningd: Will try reconnect in 94 seconds 2022-10-07T23:50:17.450Z DEBUG 03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument. 2022-10-07T23:50:17.454Z DEBUG 03…1a-lightningd: Will try reconnect in 188 secondsNote that CLN was trying to connect to a link-local IPv6 address without specifying a scope, so the system was correctly returning
EINVAL. -
I connect Breez to my CLN node. Communications work as expected.
2022-10-07T23:51:06.190Z DEBUG 03…1a-connectd: Connect IN 2022-10-07T23:51:06.192Z DEBUG 03…1a-connectd: peer_out WIRE_INIT 2022-10-07T23:51:06.194Z DEBUG 03…1a-connectd: peer_in WIRE_INIT 2022-10-07T23:51:06.194Z DEBUG 03…1a-gossipd: seeker: disabling gossip 2022-10-07T23:51:06.196Z INFO 03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: Disconnected 2022-10-07T23:51:06.196Z DEBUG 03…1a-chan#140282: Peer has reconnected, state CHANNELD_NORMAL: connecting subd 2022-10-07T23:51:06.435Z DEBUG 03…1a-channeld-chan#140282: pid 7475, msgfd 408 2022-10-07T23:51:06.435Z DEBUG 03…1a-chan#140282: Already have funding locked in (and ready to announce) 2022-10-07T23:51:06.438Z DEBUG 03…1a-chan#140282: attempting update blockheight 4c…a0 2022-10-07T23:51:06.442Z DEBUG 03…1a-connectd: Handed peer, entering loop 2022-10-07T23:51:06.442Z DEBUG 03…1a-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement …50x…54x1/1 2022-10-07T23:51:06.444Z DEBUG 03…1a-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER 2022-10-07T23:51:06.447Z DEBUG 03…1a-connectd: Activating for message WIRE_CHANNEL_REESTABLISH 2022-10-07T23:51:06.449Z DEBUG 03…1a-channeld-chan#140282: option_static_remotekey = 1, option_anchor_outputs = 0 2022-10-07T23:51:06.450Z DEBUG 03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT 2022-10-07T23:51:06.452Z DEBUG 03…1a-channeld-chan#140282: init LOCAL: remote_per_commit = 03…2c, old_remote_per_commit = 02…78 next_idx_local = 751 next_idx_remote = 751 revocations_received = 750 feerates { SENT_ADD_ACK_REVOCATION:253 } range 253-15000 blockheights { SENT_ADD_ACK_REVOCATION:0 }, our current 757607 2022-10-07T23:51:06.454Z DEBUG 03…1a-channeld-chan#140282: peer_out WIRE_CHANNEL_REESTABLISH 2022-10-07T23:51:06.455Z DEBUG 03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT 2022-10-07T23:51:06.456Z DEBUG 03…1a-channeld-chan#140282: billboard: Sent reestablish, waiting for theirs 2022-10-07T23:51:06.457Z DEBUG 03…1a-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER 2022-10-07T23:51:06.459Z DEBUG 03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT 2022-10-07T23:51:06.460Z DEBUG 03…1a-channeld-chan#140282: peer_in WIRE_CHANNEL_REESTABLISH 2022-10-07T23:51:06.462Z DEBUG 03…1a-channeld-chan#140282: Got reestablish commit=751 revoke=750 2022-10-07T23:51:06.464Z DEBUG 03…1a-channeld-chan#140282: next_revocation_number = 750 2022-10-07T23:51:06.466Z DEBUG 03…1a-channeld-chan#140282: option_static_remotekey: fields are correct 2022-10-07T23:51:06.468Z DEBUG 03…1a-channeld-chan#140282: billboard perm: Reconnected, and reestablished. 2022-10-07T23:51:06.474Z DEBUG 03…1a-channeld-chan#140282: billboard: Funding transaction locked. 2022-10-07T23:51:06.479Z DEBUG 03…1a-channeld-chan#140282: Trying commit 2022-10-07T23:51:06.497Z DEBUG 03…1a-channeld-chan#140282: Can't send commit: nothing to send, feechange not wanted ({ SENT_ADD_ACK_REVOCATION:253 }) blockheight not wanted ({ SENT_ADD_ACK_REVOCATION:0 })The "Peer transient failure" message at the
INFOlevel after reading the incomingWIRE_INITis awkward, but okay, whatever. -
I disconnect Breez from my CLN node.
2022-10-07T23:52:59.790Z INFO 03…1a-channeld-chan#140282: Peer connection lost 2022-10-07T23:52:59.790Z INFO 03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208) 2022-10-07T23:52:59.805Z DEBUG 03…1a-lightningd: peer_disconnect_done 2022-10-07T23:52:59.805Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-07T23:52:59.808Z DEBUG plugin-funder: Cleaning up inflights for peer id 03…1a -
At this point, we would expect CLN's next attempt to reconnect at 23:57:59.805Z. However, that is not what we observe.
2022-10-07T23:53:26.541Z DEBUG 03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument. 2022-10-07T23:53:26.542Z DEBUG 03…1a-lightningd: Will try reconnect in 300 secondsOops! 23:53:26.542Z is 189.088 seconds after 23:50:17.454Z (which was when CLN told us it would retry to connect in 188 seconds). Apparently the original scheduled task was not pre-empted.
-
So now will the next attempt be 300 seconds later? No, unfortunately the disconnection event at 23:52:59.808Z did schedule a new reconnection attempt in parallel with the original recurring task.
2022-10-07T23:58:01.330Z DEBUG 03…1a-connectd: Failed connected out: Unable to connect, no address known for peer 2022-10-07T23:58:01.332Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-07T23:58:27.906Z DEBUG 03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument. 2022-10-07T23:58:27.926Z DEBUG 03…1a-lightningd: Will try reconnect in 300 secondsYou can see that there are now two recurring reconnection tasks scheduled at ~300-second intervals, staggered by ~27 seconds from each other.
-
Can we cause yet a third recurring reconnection task to be scheduled? Yes, we can, simply by connecting and disconnecting the mobile node again.
2022-10-07T23:59:22.109Z DEBUG 03…1a-connectd: Connect IN 2022-10-07T23:59:22.130Z DEBUG 03…1a-connectd: peer_out WIRE_INIT 2022-10-07T23:59:22.133Z DEBUG 03…1a-connectd: peer_in WIRE_INIT 2022-10-07T23:59:22.136Z INFO 03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: Disconnected 2022-10-07T23:59:22.136Z DEBUG 03…1a-chan#140282: Peer has reconnected, state CHANNELD_NORMAL: connecting subd […snip…] 2022-10-07T23:59:34.883Z INFO 03…1a-channeld-chan#140282: Peer connection lost 2022-10-07T23:59:34.883Z INFO 03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208) 2022-10-07T23:59:34.901Z DEBUG 03…1a-lightningd: peer_disconnect_done 2022-10-07T23:59:34.901Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-07T23:59:34.913Z DEBUG plugin-funder: Cleaning up inflights for peer id 03…1a2022-10-08T00:03:05.355Z DEBUG 03…1a-connectd: Failed connected out: Unable to connect, no address known for peer 2022-10-08T00:03:05.366Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-08T00:03:28.470Z DEBUG 03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument. 2022-10-08T00:03:28.482Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-08T00:04:35.129Z DEBUG 03…1a-connectd: Failed connected out: Unable to connect, no address known for peer 2022-10-08T00:04:35.148Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-08T00:08:06.774Z DEBUG 03…1a-connectd: Failed connected out: Unable to connect, no address known for peer 2022-10-08T00:08:06.787Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-08T00:08:29.831Z DEBUG 03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument. 2022-10-08T00:08:29.832Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds 2022-10-08T00:09:38.287Z DEBUG 03…1a-connectd: Failed connected out: Unable to connect, no address known for peer 2022-10-08T00:09:38.289Z DEBUG 03…1a-lightningd: Will try reconnect in 300 seconds
My assumption is that there is no limit to the number of these recurring reconnection schedules that can pile up.
If I leave the mobile node connected, then for as long as it is connected, there is no reconnection chatter in the log. Apparently the recurring reconnection tasks do indeed acquiesce if the peer is connected at their scheduled activation time. I left my mobile node connected for longer than five minutes, to ensure that all recurring scheduled tasks would have a chance to fire while it was connected, and then I disconnected it, and now I am seeing reconnection attempts only every 5 minutes.
Curiously, it appears that only the "virgin" reconnection task tries the remembered (link-local) address. The subsequent tasks that were created upon a disconnection event don't know any addresses (and of course can't find any in the gossip store since this is an unpublished peer).
getinfo output
To avoid irreversibly migrating my database schema, I am running v0.11.2 with a large series of cherry-picked backports from master. It's mostly everything except channel aliases, zeroconf, bookkeeper, commando, the new autoclean expansions, and anything else requiring a database schema change.