Scheduled reconnection attempts stack up, presumably unbounded

Open whitslack opened this issue 3 years ago • 0 comments

Issue and Steps to Reproduce

I have an unpublished channel between my CLN node and a Breez wallet running on my Android phone. The mobile node cannot accept incoming connections.

From a fresh startup, CLN tries futilely to connect to my mobile node:

2022-10-07T23:47:25.486Z DEBUG   03…1a-lightningd: Will try reconnect in 94 seconds
2022-10-07T23:50:17.450Z DEBUG   03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument.
2022-10-07T23:50:17.454Z DEBUG   03…1a-lightningd: Will try reconnect in 188 seconds

Note that CLN was trying to connect to a link-local IPv6 address without specifying a scope, so the system was correctly returning EINVAL.

I connect Breez to my CLN node. Communications work as expected.

2022-10-07T23:51:06.190Z DEBUG   03…1a-connectd: Connect IN
2022-10-07T23:51:06.192Z DEBUG   03…1a-connectd: peer_out WIRE_INIT
2022-10-07T23:51:06.194Z DEBUG   03…1a-connectd: peer_in WIRE_INIT
2022-10-07T23:51:06.194Z DEBUG   03…1a-gossipd: seeker: disabling gossip
2022-10-07T23:51:06.196Z INFO    03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: Disconnected
2022-10-07T23:51:06.196Z DEBUG   03…1a-chan#140282: Peer has reconnected, state CHANNELD_NORMAL: connecting subd
2022-10-07T23:51:06.435Z DEBUG   03…1a-channeld-chan#140282: pid 7475, msgfd 408
2022-10-07T23:51:06.435Z DEBUG   03…1a-chan#140282: Already have funding locked in (and ready to announce)
2022-10-07T23:51:06.438Z DEBUG   03…1a-chan#140282: attempting update blockheight 4c…a0
2022-10-07T23:51:06.442Z DEBUG   03…1a-connectd: Handed peer, entering loop
2022-10-07T23:51:06.442Z DEBUG   03…1a-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement …50x…54x1/1
2022-10-07T23:51:06.444Z DEBUG   03…1a-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER
2022-10-07T23:51:06.447Z DEBUG   03…1a-connectd: Activating for message WIRE_CHANNEL_REESTABLISH
2022-10-07T23:51:06.449Z DEBUG   03…1a-channeld-chan#140282: option_static_remotekey = 1, option_anchor_outputs = 0
2022-10-07T23:51:06.450Z DEBUG   03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT
2022-10-07T23:51:06.452Z DEBUG   03…1a-channeld-chan#140282: init LOCAL: remote_per_commit = 03…2c, old_remote_per_commit = 02…78 next_idx_local = 751 next_idx_remote = 751 revocations_received = 750 feerates { SENT_ADD_ACK_REVOCATION:253 } range 253-15000 blockheights { SENT_ADD_ACK_REVOCATION:0 }, our current 757607
2022-10-07T23:51:06.454Z DEBUG   03…1a-channeld-chan#140282: peer_out WIRE_CHANNEL_REESTABLISH
2022-10-07T23:51:06.455Z DEBUG   03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT
2022-10-07T23:51:06.456Z DEBUG   03…1a-channeld-chan#140282: billboard: Sent reestablish, waiting for theirs
2022-10-07T23:51:06.457Z DEBUG   03…1a-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER
2022-10-07T23:51:06.459Z DEBUG   03…1a-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT
2022-10-07T23:51:06.460Z DEBUG   03…1a-channeld-chan#140282: peer_in WIRE_CHANNEL_REESTABLISH
2022-10-07T23:51:06.462Z DEBUG   03…1a-channeld-chan#140282: Got reestablish commit=751 revoke=750
2022-10-07T23:51:06.464Z DEBUG   03…1a-channeld-chan#140282: next_revocation_number = 750
2022-10-07T23:51:06.466Z DEBUG   03…1a-channeld-chan#140282: option_static_remotekey: fields are correct
2022-10-07T23:51:06.468Z DEBUG   03…1a-channeld-chan#140282: billboard perm: Reconnected, and reestablished.
2022-10-07T23:51:06.474Z DEBUG   03…1a-channeld-chan#140282: billboard: Funding transaction locked.
2022-10-07T23:51:06.479Z DEBUG   03…1a-channeld-chan#140282: Trying commit
2022-10-07T23:51:06.497Z DEBUG   03…1a-channeld-chan#140282: Can't send commit: nothing to send, feechange not wanted ({ SENT_ADD_ACK_REVOCATION:253 }) blockheight not wanted ({ SENT_ADD_ACK_REVOCATION:0 })

The "Peer transient failure" message at the INFO level after reading the incoming WIRE_INIT is awkward, but okay, whatever.

I disconnect Breez from my CLN node.

2022-10-07T23:52:59.790Z INFO    03…1a-channeld-chan#140282: Peer connection lost
2022-10-07T23:52:59.790Z INFO    03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-10-07T23:52:59.805Z DEBUG   03…1a-lightningd: peer_disconnect_done
2022-10-07T23:52:59.805Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-07T23:52:59.808Z DEBUG   plugin-funder: Cleaning up inflights for peer id 03…1a

At this point, we would expect CLN's next attempt to reconnect at 23:57:59.805Z. However, that is not what we observe.
```
2022-10-07T23:53:26.541Z DEBUG   03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument.
2022-10-07T23:53:26.542Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
```
Oops! 23:53:26.542Z is 189.088 seconds after 23:50:17.454Z (which was when CLN told us it would retry to connect in 188 seconds). Apparently the original scheduled task was not pre-empted.

So now will the next attempt be 300 seconds later? No, unfortunately the disconnection event at 23:52:59.808Z did schedule a new reconnection attempt in parallel with the original recurring task.

2022-10-07T23:58:01.330Z DEBUG   03…1a-connectd: Failed connected out: Unable to connect, no address known for peer
2022-10-07T23:58:01.332Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-07T23:58:27.906Z DEBUG   03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument.
2022-10-07T23:58:27.926Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds

You can see that there are now two recurring reconnection tasks scheduled at ~300-second intervals, staggered by ~27 seconds from each other.

Can we cause yet a third recurring reconnection task to be scheduled? Yes, we can, simply by connecting and disconnecting the mobile node again.

2022-10-07T23:59:22.109Z DEBUG   03…1a-connectd: Connect IN
2022-10-07T23:59:22.130Z DEBUG   03…1a-connectd: peer_out WIRE_INIT
2022-10-07T23:59:22.133Z DEBUG   03…1a-connectd: peer_in WIRE_INIT
2022-10-07T23:59:22.136Z INFO    03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: Disconnected
2022-10-07T23:59:22.136Z DEBUG   03…1a-chan#140282: Peer has reconnected, state CHANNELD_NORMAL: connecting subd
[…snip…]
2022-10-07T23:59:34.883Z INFO    03…1a-channeld-chan#140282: Peer connection lost
2022-10-07T23:59:34.883Z INFO    03…1a-chan#140282: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-10-07T23:59:34.901Z DEBUG   03…1a-lightningd: peer_disconnect_done
2022-10-07T23:59:34.901Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-07T23:59:34.913Z DEBUG   plugin-funder: Cleaning up inflights for peer id 03…1a

2022-10-08T00:03:05.355Z DEBUG   03…1a-connectd: Failed connected out: Unable to connect, no address known for peer
2022-10-08T00:03:05.366Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-08T00:03:28.470Z DEBUG   03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument.
2022-10-08T00:03:28.482Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-08T00:04:35.129Z DEBUG   03…1a-connectd: Failed connected out: Unable to connect, no address known for peer
2022-10-08T00:04:35.148Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-08T00:08:06.774Z DEBUG   03…1a-connectd: Failed connected out: Unable to connect, no address known for peer
2022-10-08T00:08:06.787Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-08T00:08:29.831Z DEBUG   03…1a-connectd: Failed connected out: All addresses failed: [fe80::…c]:38270: Connection establishment: Invalid argument.
2022-10-08T00:08:29.832Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds
2022-10-08T00:09:38.287Z DEBUG   03…1a-connectd: Failed connected out: Unable to connect, no address known for peer
2022-10-08T00:09:38.289Z DEBUG   03…1a-lightningd: Will try reconnect in 300 seconds

My assumption is that there is no limit to the number of these recurring reconnection schedules that can pile up.

If I leave the mobile node connected, then for as long as it is connected, there is no reconnection chatter in the log. Apparently the recurring reconnection tasks do indeed acquiesce if the peer is connected at their scheduled activation time. I left my mobile node connected for longer than five minutes, to ensure that all recurring scheduled tasks would have a chance to fire while it was connected, and then I disconnected it, and now I am seeing reconnection attempts only every 5 minutes.

Curiously, it appears that only the "virgin" reconnection task tries the remembered (link-local) address. The subsequent tasks that were created upon a disconnection event don't know any addresses (and of course can't find any in the gossip store since this is an unpublished peer).

`getinfo` output

To avoid irreversibly migrating my database schema, I am running v0.11.2 with a large series of cherry-picked backports from master. It's mostly everything except channel aliases, zeroconf, bookkeeper, commando, the new autoclean expansions, and anything else requiring a database schema change.

Oct 08 '22 01:10 whitslack

Scheduled reconnection attempts stack up, presumably unbounded

Issue and Steps to Reproduce

getinfo output

`getinfo` output