lnd icon indicating copy to clipboard operation
lnd copied to clipboard

Channel gets disabled but connection with peer is still alive.

Open hazrulnizam opened this issue 4 years ago • 11 comments

Background

Some channels get disabled but the connections to the channel peers remain active. Network sees that the channel is disabled from one side (my side) but is active from the peer side. They remain this way until I manually disconnect the peer.

Your environment

  • lnd v0.14.1-beta
  • Centos 8 4.18.0-305.19.1.el8_4.x86_64
  • bitcoind 22.0 running on a remote node

Steps to reproduce

Unable to reproduce on demand. But it has been happening for the past few weeks over multiple service restarts.

All the channels involved exhibit the same behaviour in the logs:

  1. ChannelLink fails and exits after attempt to update commit fee
  2. 22 minutes later channel is announced as disabled
  3. Connection with peer remains active, GossipSyncer still works on the peer.
  4. Network sees channel as disabled:true from my side but disabled:false from the peer side.
  5. This remains until I then manually disconnect the peer, which then causes the new incoming connection to fix everything.

Here are some logs to show the behaviour

Channel 1:

2021-12-16 17:46:16.695 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): no more chans to query
2021-12-16 20:04:06.922 [INF] HSWC: ChannelLink(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1): updating commit fee to 1161 sat/kw sat/kw
2021-12-16 20:25:42.874 [INF] HSWC: ChannelLink(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1): updating commit fee to 1161 sat/kw sat/kw
2021-12-16 20:26:42.874 [ERR] HSWC: ChannelLink(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1): failing link: unable to complete dance with error: remote unresponsive
2021-12-16 20:26:42.875 [INF] HSWC: ChannelLink(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1): exited
2021-12-16 20:26:42.875 [INF] HSWC: ChannelLink(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1): stopping
2021-12-16 20:47:22.690 [INF] NANN: Announcing channel(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1) disabled [detected]
2021-12-16 23:44:35.385 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): requesting new chans from height=0 and 714409 blocks after
2021-12-17 07:42:24.013 [WRN] CHBU: Replacing disk backup for ChannelPoint(eaddf2d59d74e4f9d9450241772c9180516d6d6a478cb5e23cc48a33e065b33a:1) w/ newer version
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=8000
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=8000
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=7999
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=8000
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=7999
2021-12-17 07:48:03.949 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=7998
2021-12-17 07:48:03.950 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=8000
2021-12-17 07:48:03.950 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=7999
2021-12-17 07:48:03.950 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=8000
2021-12-17 07:48:03.950 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=7944
2021-12-17 07:48:03.951 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): buffering chan range reply of size=2148
2021-12-17 07:48:03.951 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): filtering through 82087 chans
2021-12-17 07:48:04.036 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): starting query for 23 new chans
2021-12-17 07:48:04.036 [INF] DISC: GossipSyncer(0272dfe15e1f7781e46fc485249d65af013c092e5a1c474d9ede2a6e7e5624c09c): querying for 23 new channels

Channel 2:

2021-12-16 01:43:17.938 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): updating commit fee to 1053 sat/kw sat/kw
2021-12-16 02:16:39.872 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): updating commit fee to 1110 sat/kw sat/kw
2021-12-16 02:17:39.873 [ERR] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): failing link: unable to complete dance with error: remote unresponsive
2021-12-16 02:17:39.873 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): exited
2021-12-16 02:17:39.873 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): stopping
2021-12-16 02:38:22.687 [INF] NANN: Announcing channel(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1) disabled [detected]
2021-12-16 03:09:05.672 [INF] DISC: GossipSyncer(03cb29720b97a67c8b21de36492e45ed28e25b06a04d880b1cbd5fc77c7e0ca782): applying new update horizon: start=2021-12-16 03:09:03 +0800 +08, end=2158-01-22 09:37:18 +0800 +08, backlog_size=0
2021-12-16 03:29:03.971 [INF] DISC: GossipSyncer(03cb29720b97a67c8b21de36492e45ed28e25b06a04d880b1cbd5fc77c7e0ca782): applying new update horizon: start=2042-07-14 09:04:00 +0800 +08, end=2042-07-14 09:04:00 +0800 +08, backlog_size=0

Channel 3 (with manual disconnect at the end):

2021-12-16 05:32:38.204 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): updating commit fee to 1561 sat/kw sat/kw
2021-12-16 05:33:38.204 [ERR] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): failing link: unable to complete dance with error: remote unresponsive
2021-12-16 05:33:38.205 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): exited
2021-12-16 05:33:38.206 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): stopping
2021-12-16 05:54:22.686 [INF] NANN: Announcing channel(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0) disabled [detected]
2021-12-16 07:38:10.008 [INF] DISC: GossipSyncer(03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad): buffering chan range reply of size=7999
2021-12-16 07:38:10.008 [INF] DISC: GossipSyncer(03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad): buffering chan range reply of size=8000
2021-12-16 07:38:10.008 [INF] DISC: GossipSyncer(03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad): buffering chan range reply of size=7999

....

2021-12-17 08:29:28.841 [INF] SRVR: Disconnecting from 03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad@185.220.100.248:19106
2021-12-17 08:29:28.841 [INF] PEER: disconnecting 03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad@185.220.100.248:19106, reason: server: disconnecting peer 03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad@185.220.100.248:19106
2021-12-17 08:29:28.943 [INF] DISC: Removing GossipSyncer for peer=03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad
2021-12-17 08:29:32.337 [INF] SRVR: Finalizing connection to 03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad@209.127.17.234:48042, inbound=true
2021-12-17 08:29:32.573 [INF] PEER: NodeKey(03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad) loading ChannelPoint(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0)
2021-12-17 08:29:32.574 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): starting
2021-12-17 08:29:32.574 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0)
2021-12-17 08:29:32.574 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): HTLC manager started, bandwidth=96316000 mSAT
2021-12-17 08:29:32.574 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): attempting to re-synchronize
2021-12-17 08:29:32.574 [INF] PEER: Negotiated chan series queries with 03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad
2021-12-17 08:29:32.575 [INF] DISC: Creating new GossipSyncer for peer=03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad
2021-12-17 08:29:33.539 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): received re-establishment message from remote side
2021-12-17 08:42:53.698 [INF] DISC: GossipSyncer(03ee951b50d5dc4631bfbde250e7fd2a119a108b7ba3d62079574c584e536563ad): applying new update horizon: start=2021-12-17 08:42:53 +0800 +08, end=2158-01-23 15:11:08 +0800 +08, backlog_size=0
2021-12-17 08:48:18.402 [INF] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): updating commit fee to 1991 sat/kw sat/kw
2021-12-17 08:48:32.575 [INF] NANN: Announcing channel(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0) enabled

Expected behaviour

Before disabling a channel, check whether the peer is truly offline and unreachable before finalising the channel disable. Maybe attempt automated disconnect/reconnect to try re-enabling the channel.

Actual behaviour

Channel is announced as disabled, and lnd just waits for a Channel Reestablish message from peer (which never comes because peer has an active connection and thinks channel is still OK).

I will have to manually disconnect the peer to initiate a new fresh connection which enables the channel.

hazrulnizam avatar Dec 17 '21 01:12 hazrulnizam

2021-12-16 02:17:39.873 [ERR] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:1): failing link: unable to complete dance with error: remote unresponsive

This gets printed out when the remote party doesn't respond to a signature or revocation. So in this case, we disconnect, and then wait for them to connect again, or fall back to our normal exponential back off.

2021-12-16 05:33:38.204 [ERR] HSWC: ChannelLink(bac1e5b2ef4745e30c0aa16e7a298590674297de17ad7f3addf0668d0a7eff22:0): failing link: unable to complete dance with error: remote unresponsive

Here is is again with that same peer.

So AFAICT, that peer doesn't respond to us trying to update the channel, we then disconnect to try and cycle things out. Once we connect, they're never stable enough for us to enable again.

Roasbeef avatar Dec 18 '21 02:12 Roasbeef

Those two messages are with two different peers. They have the same funding txid because I used batch channel opening. If you see carefully the output index is different.

So in this case, we disconnect, and then wait for them to connect again, or fall back to our normal exponential back off.

But the disconnect did not happen? Connection with peer is still there, as evidenced by the still functioning GossipSyncer. If I do a lncli listpeers at that point, that peer will be in the list of connected peers. I have to manually disconnect to cycle things out.

So AFAICT, that peer doesn't respond to us trying to update the channel, we then disconnect to try and cycle things out. Once we connect, they're never stable enough for us to enable again.

The reason I am raising this issue is that the disconnect you are saying should happen, does not happen.

hazrulnizam avatar Dec 18 '21 03:12 hazrulnizam

Okay, found the culprit. I had the setting db.batch-commit-interval set to 5000ms in the config. Once I commented this out, the issue no longer happens (I waited a few days to see).

I do not see how it should change any behavior though, seems to me the setting is totally unrelated to how lnd should manage peer unresponsiveness.

hazrulnizam avatar Dec 21 '21 14:12 hazrulnizam

I have db.batch-commit-interval commented and I also see this with a couple of peers (lnmarkets.com and Bitrefill Routing). Same symptoms:

Jan 05 15:01:36 fullnode lnd[48205]: 2022-01-05 15:01:36.502 [ERR] HSWC: ChannelLink(e901563235eaa59a022adc7e049190daf30682460cf5365eccb86ce350ef04cc:1): failing link: unable to complete dance with error: remote unresponsive
Jan 05 15:01:36 fullnode lnd[48205]: 2022-01-05 15:01:36.502 [INF] HSWC: ChannelLink(e901563235eaa59a022adc7e049190daf30682460cf5365eccb86ce350ef04cc:1): exited
Jan 05 15:01:36 fullnode lnd[48205]: 2022-01-05 15:01:36.502 [INF] HSWC: Removing channel link with ChannelID(cc04ef50e36cb8cc5e36f50c468206f3da9091047edc2a029aa5ea35325601e8)
Jan 05 15:01:36 fullnode lnd[48205]: 2022-01-05 15:01:36.503 [INF] HSWC: ChannelLink(e901563235eaa59a022adc7e049190daf30682460cf5365eccb86ce350ef04cc:1): stopping

The channel is disabled 22 minutes later:

Jan 05 15:23:13 fullnode lnd[48205]: 2022-01-05 15:23:13.139 [INF] NANN: Announcing channel(e901563235eaa59a022adc7e049190daf30682460cf5365eccb86ce350ef04cc:1) disabled [detected]

The peer stays connected but the channel is seen as inactive, oftentimes with pending HTLCs in it. Disconnecting the peer manually causes reconnect and renegotiation, after that the channel works again. But it may stay in this state for hours until I notice which is obviously bad. I realize it's a problem on their end and not mine but I expect lnd to retry connection instead of giving up instantly. It also happens randomly.

rkfg avatar Jan 05 '22 13:01 rkfg

Okay, found the culprit. I had the setting db.batch-commit-interval set to 5000ms in the config. Once I commented this out, the issue no longer happens (I waited a few days to see).

If this was set to a high value, and you had a lot of incoming gossip traffic, then it would cause goroutines to block at times for 5+ seconds. At that point it's possible the other end hits a deadline w.r.t the write timeout, causing the peer to disconnect.

Jan 05 15:01:36 fullnode lnd[48205]: 2022-01-05 15:01:36.502 [ERR] HSWC: ChannelLink(e901563235eaa59a022adc7e049190daf30682460cf5365eccb86ce350ef04cc:1): failing link: unable to complete dance with error: remote unresponsive

I think what needs to happen here is that we also tear down the connection, which would effectively cause a reconnect.

We have a few other bug fixes in this area for persistent connections, as well as this comprehensive issue report, which should help narrow things down.

Roasbeef avatar Jan 05 '22 23:01 Roasbeef

Closing this for now as the root issue was discovered, and we have other in flight PRs (as well as merged changes to be released as part of 0.14.2) that target known issues in the p2p code.

Roasbeef avatar Jan 19 '22 00:01 Roasbeef

Since it still keeps happening here's a temporary script to restore connectivity:

#!/bin/sh
lncli listchannels --inactive_only --public_only | jq -r '.channels[].remote_pubkey' | sort -u | (
  while read rpk
  do
        lncli disconnect $rpk >/dev/null 2>&1 && echo "Node $rpk recovered"
  done
)
exit 0

It simply disconnects all peers you have inactive channels with which should trigger a reconnect. Put it to cron so it runs every 10 minutes or so, since channels are disabled after 22 minutes of inactivity by default this should prevent disabling them.

rkfg avatar Feb 12 '22 11:02 rkfg

Was this actually fixed? According to my logs it wasn't, the script above prints the message about a recovered node. It only does so if the disconnect command succeeded (returned 0) and it only tries to disconnect peers of inactive channels. If a peer is really offline this commands returns an error and the message isn't printed.

Trying 0.15.99-beta (9d04b0c3d9577b2ab8c253e140220e80181619b8) currently to see if something changed since 0.15.0 where it happened every day multiple times, still I'd be glad for a clarification.

EDIT: investigating further, might be a race condition when the peer connects precisely after the iteration starts.

rkfg avatar Sep 04 '22 16:09 rkfg

Yes, still happens, confirmed.

Sep 06 22:16:33 fullnode lnd[67749]: 2022-09-06 22:16:33.978 [ERR] HSWC: ChannelLink(5093d67a7a3b875879fb8ec7a5af2dba64cc475f84990e8d417460d30b0d8334:1): failing link: unable to complete dance with error: remote unresponsive
Sep 06 22:16:33 fullnode lnd[67749]: 2022-09-06 22:16:33.979 [INF] HSWC: ChannelLink(5093d67a7a3b875879fb8ec7a5af2dba64cc475f84990e8d417460d30b0d8334:1): exited
Sep 06 22:16:33 fullnode lnd[67749]: 2022-09-06 22:16:33.983 [INF] HSWC: ChannelLink(5093d67a7a3b875879fb8ec7a5af2dba64cc475f84990e8d417460d30b0d8334:1): stopping
Sep 06 22:16:33 fullnode lnd[67749]: 2022-09-06 22:16:33.985 [INF] HSWC: Removing channel link with ChannelID(34830d0bd36074418d0e99845f47cc64ba2dafa5c78efb7958873b7a7ad69351)
Sep 06 22:17:08 fullnode lnd[67749]: 2022-09-06 22:17:08.977 [INF] DISC: Broadcasting 237 new announcements in 17 sub batches
Sep 06 22:17:12 fullnode lnd[67749]: 2022-09-06 22:17:12.923 [INF] CRTR: Processed channels=0 updates=155 nodes=9 in last 1m0.00047007s
Sep 06 22:18:12 fullnode lnd[67749]: 2022-09-06 22:18:12.922 [INF] CRTR: Processed channels=0 updates=149 nodes=12 in last 59.999107883s
Sep 06 22:18:39 fullnode lnd[67749]: 2022-09-06 22:18:38.977 [INF] DISC: Broadcasting 225 new announcements in 18 sub batches
Sep 06 22:19:08 fullnode lnd[67749]: 2022-09-06 22:19:08.976 [INF] DISC: GossipSyncer(03b3007e8230bbb7542bd697846b26736dc8f9c117f3a8e895247354c773fd66b8): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
Sep 06 22:19:08 fullnode lnd[67749]: 2022-09-06 22:19:08.977 [INF] DISC: GossipSyncer(021607cfce19a4c5e7e6e738663dfafbbbac262e4ff76c2c9b30dbeefc35c00643): applying gossipFilter(start=2022-09-06 22:19:08.977166686 +0300 MSK m=+184867.212094278, end=2158-10-14 04:47:23.977166686 +0300 MSK)
Sep 06 22:19:12 fullnode lnd[67749]: 2022-09-06 22:19:12.922 [INF] CRTR: Processed channels=0 updates=120 nodes=4 in last 1m0.000126167s
Sep 06 22:20:08 fullnode lnd[67749]: 2022-09-06 22:20:08.977 [INF] DISC: Broadcasting 188 new announcements in 18 sub batches
Sep 06 22:20:11 fullnode lnd[67749]: 2022-09-06 22:20:11.411 [INF] SRVR: Disconnecting from 03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7@185.220.101.183:31308
Sep 06 22:20:11 fullnode lnd[67749]: 2022-09-06 22:20:11.411 [INF] PEER: Peer(03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7): disconnecting 03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7@185.220.101.183:31308, reason: server: DisconnectPeer called
Sep 06 22:20:11 fullnode lnd[67749]: 2022-09-06 22:20:11.412 [INF] PEER: Peer(03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7): unable to read message from peer: read tcp 10.40.0.2:9735->185.220.101.183:31308: use of closed network connection
Sep 06 22:20:11 fullnode lnd[67749]: 2022-09-06 22:20:11.412 [INF] NTFN: Cancelling epoch notification, epoch_id=287
Sep 06 22:20:11 fullnode lnd[67749]: 2022-09-06 22:20:11.612 [INF] DISC: Removing GossipSyncer for peer=03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7
Sep 06 22:20:12 fullnode lnd[67749]: 2022-09-06 22:20:12.922 [INF] CRTR: Processed channels=0 updates=128 nodes=6 in last 59.999435532s
Sep 06 22:20:13 fullnode lnd[67749]: 2022-09-06 22:20:13.671 [INF] SRVR: New inbound connection from 185.220.101.58:22820
Sep 06 22:20:13 fullnode lnd[67749]: 2022-09-06 22:20:13.672 [INF] SRVR: Finalizing connection to 03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7@185.220.101.58:22820, inbound=true
Sep 06 22:20:13 fullnode lnd[67749]: 2022-09-06 22:20:13.689 [INF] PEER: Peer(03154d3bbf93e06dedc34f41c5096c436849b1470210f4a30151093437d9173ce7): loading ChannelPoint(5093d67a7a3b875879fb8ec7a5af2dba64cc475f84990e8d417460d30b0d8334:1)
Sep 06 22:20:13 fullnode lnd[67749]: 2022-09-06 22:20:13.698 [INF] HSWC: ChannelLink(5093d67a7a3b875879fb8ec7a5af2dba64cc475f84990e8d417460d30b0d8334:1): starting

Notice failing link: unable to complete dance with error: remote unresponsive and then no disconnect until it was disconnected by the command from the script (reason: server: DisconnectPeer called), after that the peer quickly reconnected and worked fine. Can we reopen this issue since it's still not resolved and it's a very serious issue because without external scripts like mine the peer can stay connected but inactive for indefinite time. The channel becomes unusable because lnd can't reconnect by itself!

rkfg avatar Sep 07 '22 00:09 rkfg

Yes, I can confirm that this still happens on my node as well. I hope they can re-open this issue as well. Or if not, maybe you can open a new one, and mention that this one was closed without fully resolving it?

hazrulnizam avatar Sep 07 '22 00:09 hazrulnizam

@Roasbeef can we reopen this please?

rkfg avatar Sep 14 '22 21:09 rkfg

To summarize the worst case here:

  • The peer is unresponsive when they owe us a RevokeAndAck and we are trying to send a new CommitSig
  • The link shuts down after a timeout has passed
  • The peer connection is still alive and if a reconnect never occurs, the link will stay down.
  • This leads to force close because if the p2p connection is never refreshed, the link will never receive update_fail_htlc/update_fulfill_htlc to resolve a lingering HTLC

Crypt-iQ avatar Mar 08 '23 18:03 Crypt-iQ

This probably deserves a 16.3 tag or similar, no?

TheBlueMatt avatar May 18 '23 13:05 TheBlueMatt