lnd icon indicating copy to clipboard operation
lnd copied to clipboard

lnd force-closes in response to warning messages

Open TheBlueMatt opened this issue 2 years ago • 18 comments

A while back we added warning messages to the spec with the purpose of having a way to signal to nodes that something has gone wrong but that we do not want to force close any channels.

We started using warning messages to tell peers when we ignored a gossip message because it was bogus, with the message "Unreadable/bogus gossip message".

However, it looks like lnd interprets warning messages as critical errors and responds by force-closing channels - ignoring the entire purpose of warning messages?

A peer of mine pulled this from their log:

2022-08-06 12:14:48.864 [ERR] HSWC: ChannelLink(0db90ff58728d7698dbf9e478eca33109b9777898acf28b08bd9352a75a7a74c:0): failing link: ChannelPoint(0db90ff58728d7698dbf9e478eca33109b9777898acf28b08bd9352a75a7a74c:0): received error from peer: chan_id=0000000000000000000000000000000000000000000000000000000000000000, err=Unreadable/bogus gossip message with error: remote error

On my end I just received an error message that said "remote error".

TheBlueMatt avatar Aug 06 '22 18:08 TheBlueMatt

I've now seen this twice, the second time the peer was running "master... something between 0.15 and 0.15.1 115b0418747ba941875bd9f2be1bb18754f6f636"

TheBlueMatt avatar Aug 06 '22 18:08 TheBlueMatt

I'm said peer. Let me know if you need further information.

C-Otto avatar Aug 06 '22 19:08 C-Otto

@TheBlueMatt just checking: the type for those messages is set to 1?

positiveblue avatar Aug 06 '22 20:08 positiveblue

Yes, the type bytes written on the wire is pulled directly from https://github.com/lightningdevkit/rust-lightning/blob/main/lightning/src/ln/wire.rs#L308

TheBlueMatt avatar Aug 06 '22 21:08 TheBlueMatt

I will try to replicate it locally but for the logs it looks like the message was really parsed as error and not a warning @C-Otto do you do a quick search to see if you have any logs matching received warning message from peer for this or any other peer?

positiveblue avatar Aug 06 '22 21:08 positiveblue

I can't find a single entry in the recent logs (~1 month).

C-Otto avatar Aug 06 '22 21:08 C-Otto

I tried to replicate it, unsuccessfully, in the main branch + added a test for it #6805

I noticed that it is tagged for v0.16.0 BUT if @C-Otto is running master it should include those changes. I am not sure what is the process for tagging a version (v0.15.1 for example) that not includes the latest changes in master though (maybe we manually cherry pick the commits that need to be included?).

@Roasbeef https://github.com/lightningnetwork/lnd/commit/115b0418747ba941875bd9f2be1bb18754f6f636 should include those changes right?

positiveblue avatar Aug 07 '22 02:08 positiveblue

I guess you mean 99828b8ebb72e18be69a7870d5342d20a748412c? That's included in 115b0418747ba941875bd9f2be1bb18754f6f636.

C-Otto avatar Aug 07 '22 08:08 C-Otto

I can't find a single entry in the recent logs (~1 month).

I just realized that I've only started using the new code yesterday, which also means that older logs cannot contain this new log entry.

C-Otto avatar Aug 07 '22 09:08 C-Otto

Did you started using the new code yesterday AFTER the force close? If so this solves the case. Next warning message should not trigger any force close/channel failure.

positiveblue avatar Aug 07 '22 17:08 positiveblue

Did you started using the new code yesterday AFTER the force close? If so this solves the case. Next warning message should not trigger any force close/channel failure.

That thought also crossed my mind :) But no, I triple checked. I first started with the new code, then the quoted log message appeared.

C-Otto avatar Aug 07 '22 18:08 C-Otto

I'm not sure how any old code should have caused this, either? The message here is odd, which the spec has always mandated must be ignored if they aren't understood, so whether you understand a warning message or not shouldn't change behavior at all.

Note that I've seen something that may have been the same issue recently as well, though I wasn't able to get in contact with the operator of that node.

TheBlueMatt avatar Aug 07 '22 18:08 TheBlueMatt

We started to understand warning messages in this PR which is unreleased (only exists in master). There should be a different log for nodes on master, and nodes on 0.15 and before should just be ignoring the message all together. So I think we need to set up a repro case here.

This PR seems to indicate we properly won't do anything when we get the message, so perhaps something else is going on here.

@positiveblue

I am not sure what is the process for tagging a version (v0.15.1 for example) that not includes the latest changes in master though

We always branch minor release off that base branch. So 0.15.1 will only take in changes in that milestone, so it won't include us reading the warning message as an example.

Roasbeef avatar Aug 07 '22 21:08 Roasbeef

Hmm, so for the peer in question here, the only non-ping/pong log lines for c-otto's node I have within two minutes leading up to the force close are as follows, note that this peer does not log gossip messages to avoid filling disk, but logs all other sent & received messages.

2022-08-06 10:14:48.796 TRACE [lightning::ln::peer_handler:851] Enqueueing message WarningMessage { channel_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], data: "Unreadable/bogus gossip message" } to 027ce055380348d7812d2ae7745701c9f93e70c1adeb2657f053f91df4f2843c71
2022-08-06 10:14:48.910 TRACE [lightning::ln::peer_handler:1135] Received message Error(ErrorMessage { channel_id: [76, 167, 167, 117, 42, 53, 217, 139, 176, 40, 207, 138, 137, 119, 151, 155, 16, 51, 202, 142, 71, 158, 191, 141, 105, 215, 40, 135, 245, 15, 185, 13], data: "remote error" }) from 027ce055380348d7812d2ae7745701c9f93e70c1adeb2657f053f91df4f2843c71
2022-08-06 10:14:48.910 DEBUG [lightning::ln::peer_handler:1158] Got Err message from 027ce055380348d7812d2ae7745701c9f93e70c1adeb2657f053f91df4f2843c71: remote error
2022-08-06 10:14:48.910 ERROR [lightning::ln::channelmanager:1981] Force-closing channel 4ca7a7752a35d98bb028cf8a8977979b1033ca8e479ebf8d69d72887f50fb90d

Similarly, the other time I saw a similar failure happen I have this in my log:

2022-07-26 10:26:48.658 TRACE [lightning::ln::peer_handler:851] Enqueueing message WarningMessage { channel_id: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], data: "Unreadable/bogus gossip message" } to 0338f87cb05016c9427de7872192615f9313d622db1a88f6c2594625ffd0b2d270
2022-07-26 10:26:48.886 TRACE [lightning::ln::peer_handler:1135] Received message Error(ErrorMessage { channel_id: [147, 212, 130, 47, 32, 192, 217, 174, 164, 175, 122, 226, 84, 185, 254, 64, 111, 179, 176, 82, 182, 186, 91, 53, 214, 103, 237, 40, 42, 107, 175, 228], data: "remote error" }) from 0338f87cb05016c9427de7872192615f9313d622db1a88f6c2594625ffd0b2d270
2022-07-26 10:26:48.886 DEBUG [lightning::ln::peer_handler:1158] Got Err message from 0338f87cb05016c9427de7872192615f9313d622db1a88f6c2594625ffd0b2d270: remote error
2022-07-26 10:26:48.886 ERROR [lightning::ln::channelmanager:1981] Force-closing channel 93d4822f20c0d9aea4af7ae254b9fe406fb3b052b6ba5b35d667ed282a6bafe4

TheBlueMatt avatar Aug 07 '22 23:08 TheBlueMatt

The problem is that handleError converts the Warning to an Error message and passes it to every link since it's the all-zero connection id:

handleError: https://github.com/lightningnetwork/lnd/blob/6060e05d7cbba138b91051529d8ffbfc1672ed54/peer/brontide.go#L1535

all-zero: https://github.com/lightningnetwork/lnd/blob/6060e05d7cbba138b91051529d8ffbfc1672ed54/peer/brontide.go#L1684-L1688

Crypt-iQ avatar Aug 08 '22 15:08 Crypt-iQ

@Crypt-iQ where is it converting there before it passes the message? It converts to check what channel is being targeted, but nextMsg (which should still be lnwire.Warning is what's passed to the link): https://github.com/lightningnetwork/lnd/blob/master/peer/brontide.go#L1607-L1609

https://github.com/lightningnetwork/lnd/blob/master/peer/brontide.go#L1437-L1443

EDIT: I see now it'll just send it directly there, which seems is the root issue here. I think we can resolve that by adding a slime interface with an IsConnectionWide() method and also one to obtain the target chan ID (or just that), then we can properly pass along the proper type.

Roasbeef avatar Aug 08 '22 19:08 Roasbeef

@TheBlueMatt The reason y'all send the warning is also pretty interesting ("Unreadable/bogus gossip message"), might deserve another issue in the tracker. Is is that the gossip message is invalid from y'alls PoV, or it has like unknown types or something?

Roasbeef avatar Aug 08 '22 19:08 Roasbeef

It means its outright invalid. I haven't dug into why, however, its possible its just things with the hostname field that we decide are invalid, given that's the big change we made, and its a super super tiny fraction of gossip, so I haven't bothered. I've seen the messages go out a handful of times.

TheBlueMatt avatar Aug 08 '22 19:08 TheBlueMatt