bolts Nodes shouldn't publish their commitment when receiving outdated `channel

This was discussed during the spec meeting and is weakly linked to #932

Let's imagine we have two nodes Alice and Bob. Alice takes her node offline, does weird stuff with her DB (e.g. a migration from sqlite to postgres) then comes back online. She sends a channel_reestablish to Bob with outdated values (because she messed up her migration).

Expected behavior

Bob detects that Alice is late, so Bob will likely need to publish his latest commitment to help Alice get her funds back. Bob waits for Alice to send an error before publishing his commitment.

When Alice receives Bob's channel_reestablish, she realizes she's late. She stops her node (without sending an error), figures out where she messed up in her migration, fixes her DB, restarts. Now she sends a channel_reestablish with the up-to-date value, so the channel can resume operating.

Non optimal behavior

Bob detects that Alice is late, so Bob publishes his latest commitment to help Alice. But now Alice lost her chance to fix her DB. If Alice is a big node with a ton of channels, she just lost a ton of money on on-chain fees...

Conclusion

Implementations should really wait to receive an error before publishing their commitment! @Roasbeef is that clearer than during the spec meeting?

EDIT: the spec was clarified in #942 to highlight that this is the desired behavior.

Nov 08 '21 20:11 t-bast

OK, I looked at this for the latest c-lightning release. Changing behavior to not close when sending error was a big change this close to release, but there is an alternative: send a warning instead on an error here:

BOLT 2:
	 *  - otherwise:
	 *    - if `next_revocation_number` is not equal to 1 greater
	 *      than the commitment number of the last `revoke_and_ack` the
	 *      receiving node has sent:
	 *      - SHOULD fail the channel.

This works for this particular case, at least.

Apr 01 '22 01:04 rustyrussell

There is another thing that is unclear to me! what does the recipient need to do in the case when he receives the warning message?

Right now, the spec (with warning message proposed by @rustyrussell ) gives the possibility to know where there is some error, but what about the following case:

Sender -> outdated channel_reestablish -> receiver
Sender <- warning <- receiver
Sender -> another outdated channel_reestablish -> receiver

In this case, the sender should have only one choice, or send the right channel_reestablish or he should be able to force close the channel, right?

because in che case of warning message we can fall in a ping pong case, where the sender continue to send wrong data, and the receive continue to response with warnings!

Apr 22 '22 18:04 vincenzopalazzo

That isn't something the spec should really concern itself with - the node that realized its out of date has to do something, but what that is is up to the implementation - it can suggest the user search for a newer state on drive somewhere, it can ask the peer to force-close (using an error message) or it can throw up its hands and light the cat on fire.

Apr 22 '22 18:04 TheBlueMatt

Ok! this is clear, but in the spec should be something that should avoid the ping point event? In other words, the sender can put another wrong channel_reestablish? and in this case, the receiver should force close the channel?

Or this is something that will never happen?

Apr 22 '22 18:04 vincenzopalazzo

After some reasoning over a discussion that @TheBlueMatt points out on LDK (https://github.com/lightningdevkit/rust-lightning/pull/1430#discussion_r860156878)

I think that I expressed my idea wrong here, and I agree that all this behavior should not specify in the spec, but in my opinion, here we should have a line that specifies "the sender should not send multiple times the wrong channel_restablish" because we know that if a node receives a wrong channel_reestablish it should send a channel_reestablish with the data updated.

What are the benefits of this addition? that we know when a node is not spec compliant, and we can detect this with a test in lnprototest. I mean what is the benefit to keep receiving them wrong channel_restablish?

On the other hand, this is possible from a software point of view, basically because the sender doesn't take into count this case and ignores the update data received with the sender channel_reestablish and due to a possible DB data lost for the send, the data that it sends are perfectly sane.

Apr 29 '22 08:04 vincenzopalazzo

Shifting the conversation over here from implementation issue trackers: https://github.com/ElementsProject/lightning/issues/5876#issuecomment-1375138059

What do y'all think about that proposed upgrade mechanism? In short, an optional feature bit that signals both nodes know of and prefer the new behavior (as it's a breaking change from the PoV of other nodes). This way old nodes that were deployed when the existing behavior was considered standard can still use their disaster recovery tools to prevent loss of funds.

Jan 09 '23 06:01 Roasbeef

The reason this is currently just an issue and not a PR is that the spec doesn't require implementations to force-close when receiving an outdated channel_reestablish, which is why implementations differ in this regard: eclair does not force-close until it receives an error and never has, so it wouldn't be a breaking change from its point of view, and it thus never was a mechanism nodes could rely on (even though apparently lnd did).

Adding a feature bit for that is IMHO unnecessary complexity, that eclair wouldn't bother to implement. From my point of view this is simply a bug in old lnd clients, so if they want to see this fixed they should upgrade their node. We've seen that people do upgrade their nodes easily for important bugs, so I wouldn't add implementation constraints on everyone to work around this.

I'll open a PR to clarify the spec though, the requirements around channel_reestablish are a bit hard to follow.

Jan 09 '23 07:01 t-bast

Actually this was even already clarified in the spec:

A receiving node:
  - otherwise, if it supports `option_data_loss_protect`:
    - if `next_revocation_number` is greater than expected above, AND
    `your_last_per_commitment_secret` is correct for that
    `next_revocation_number` minus 1:
      - MUST NOT broadcast its commitment transaction.
      - SHOULD send an `error` to request the peer to fail the channel.
      - SHOULD store `my_current_per_commitment_point` to retrieve funds
        should the sending node broadcast its commitment transaction on-chain.

It was clarified by #942 more than a year ago.

Jan 09 '23 07:01 t-bast

I also opened an issue on lnd more than a year ago, which has been completely ignored: https://github.com/lightningnetwork/lnd/issues/6017

Jan 09 '23 08:01 t-bast

Can't lnd also send an error message in addition to the bogus channel _reestablish? That would make it robust both with existing nodes and older nodes.

Jan 09 '23 08:01 TheBlueMatt

I don't think that would achieve what laolu wants, IIUC he'd like updated nodes to keep force-closing channels with older nodes that don't send an error, to help them recover their funds without having to upgrade lnd.

Jan 09 '23 09:01 t-bast

I guess I don't understand why that's a requirement? If you're closing using an SCB, your setup is hosed. Having to upgrade as a step in that process doesn't seem crazy given you're already tearing it all down anyway.

Jan 09 '23 16:01 TheBlueMatt

I agree with that :+1:

Jan 09 '23 16:01 t-bast

Can't lnd also send an error message in addition to the bogus channel _reestablish?

Oh yes it can for new nodes (easy enough to impl), however the older nodes won't observe that behavior as they're un-upgraded and from the PoV there's been no sort of feature bit shift or feature negotiation.

If you're closing using an SCB, your setup is hosed. Having to upgrade as a step in that process doesn't seem crazy given you're already tearing it all down anyway.

I think that's quite an assumption (also ignores the possibility of breaking API changes or w/e in later versions), over time we can't expect users/developers/services/businesses to always 100% upgrade to the latest version. For lnd itself some companies/services maintain forks with custom stuff, so upgrading isn't a trivial matter as they may need to rebase on top of months of work. For now we'll likely need to provide a patch (ahead of the next major releases, likely back ported too) to all those that have their funds stuck in this situation as a short term workaround so they can recover their funds.

The main request here is that the new behavior be gated since old nodes are relying on the old behavior (generally so, and not just in this instance). The old behavior was to fail, the new behavior is to send an error (change in behavior). IMO this should be the standard way for any changes to prior behavior, though I know today we at times just sort of change behavior w/o any feature bit presence as we say: "ok everyone is going to do this now". Unfortunately that ignores the lagging nodes, and also smaller/older/slower moving implementations.

IMO without introducing proper signalling for breaking change (that might just seem like a spec clarification), we end up in a situation where the spec is a bit too "living" and all ppl can say is "I've implemented up to git commit ae4fb, but ignoring these commits of optional behavior".

Jan 09 '23 20:01 Roasbeef

from the PoV there's been no sort of feature bit shift or feature negotiation.

FWIW, error messages have always been the spec-mandated way to close a channel and request your counterparty force-close. As @t-bast points out sending a channel_reestablish that's stale (or in the case of the lnd SCB entirely bogus - channels cannot have a zero index once funded) was never something the spec suggested.

Generally, I'm not really a fan of worrying about introducing a feature bit to indicate that something the spec already supported is now being done. Doing that means the spec is kinda worthless - now we're redefining the spec to "whatever nodes do in the field" and then mandating feature bits and the like whenever "what nodes do in the field" changes, even if the spec itself never supported or indicated it at all.

I think that's quite an assumption (also ignores the possibility of breaking API changes or w/e in later versions), over time we can't expect users/developers/services/businesses to always 100% upgrade to the latest version.

Oh, I totally agree, dont read my suggestion here as "users must always upgrade", but instead that "the first step of an SCB recovery should be to upgrade", which I think is totally reasonable. When you're doing something comparatively risky like an SCB recovery, ensuring you're on the latest code is probably generally good advice, no?

Jan 09 '23 20:01 TheBlueMatt

Doing that means the spec is kinda worthless - now we're redefining the spec to "whatever nodes do in the field

That's basically how it is now (we go back and change behavior to match either what nodes do or what we think they should do), though I think we should spin that off to another discussion issue/ML post (more of a meta topic). The PR above is an example of that: it removed a fragment saying to fail the channel.

Going back on topic -- I went to go poke around our code along side a spec scan and saw this clause:

otherwise, if it supports option_data_loss_protect:

otherwise (your_last_per_commitment_secret or my_current_per_commitment_point do not match the expected values):

SHOULD send an error and fail the channel.

For any sort of SCB functionally, we'll actually purposefully send an invalid my_current_per_commitment_point. So then by this reading, it doesn't seem like there should have been an actual behavioral change: you should send the error if things don't match up (the heights), but should error and fail if the commitment point they should know about is invalid.

To make the scenario explicit:

Node is recovering, from a blank DB state.
They insert their SCB, from their PoV the current commitment height is: (0, 0)
They send a chan reest w/ next_commitment_number=1, next_revocation_number=0, along with an invalid my_current_per_commitment_point

From my reading this should bypass the "if next_revocation_number is greater than expected above" clause, and go to the section where it still prompts a node to fail the channel then error.

Do y'all also interpret this section as such?

EDIT: I also still see a few instances of "fail channel and send error", eg:

if next_revocation_number is not equal to 1 greater than the commitment number of the last revoke_and_ack the receiving node has sent:

SHOULD send an error and fail the channel.

Jan 10 '23 01:01 Roasbeef

The requirements for channel_reestablish in the spec are still confusing, #942 only updated the requirements that explicitly mention that they're for the receiving node, assuming that it would trump the other requirements, but some of them are actually conflicting. But the intent is clear and explained in the rationale: The other node should wait for that error to give the fallen-behind node an opportunity to fix its state first (e.g by restarting with a different backup).. And it's been a year since that has been integrated!

I'll create a spec PR to clean up the requirements around channel_reestablish, but regardless of that, I think your proposal of adding a feature bit doesn't make sense in that case for the following two reasons:

this is not a breaking change, since eclair's behavior hasn't changed and the spec change isn't a breaking change either
the spec requirements around failing a channel have always been a SHOULD, not a MUST, so if you relied on it being a MUST for your SCB you should have proposed changing that to a MUST or at least test the behavior of other implementations, that would have let us detect much earlier that implementations differed in their understanding of that part of the spec

Does that make sense?

Jan 10 '23 07:01 t-bast

The requirements for channel_reestablish in the spec are still confusing

Yeah I think when re-reading it, the section of "a node" vs "a receiver" seems to sort of lack precedence (if one sort of conflicts another, which should an implementation follow?).

this is not a breaking change, since eclair's behavior hasn't changed and the spec change isn't a breaking change either

Yeah I understand the Eclair's behavior hasn't changed (AFAIK, our SCB flow has worked w/ eclair nodes np for years, and there're some very beefy lnd<->eclair chans on the network).

The purpose of the scenario above was to find out if CLN and/or LDK change their behavior to just short circuit (if the heights don't match up), and not also check to see if the commitment point/secret is invalid for that state. Updating new nodes in the next lnd release to send the error is easy enough (like 5 lines of code). My only concern is the older nodes that may have had their flows broken by potential short circuit behavior outlined above. As mentioned above, we can give those nodes a patch (which they can ideally apply) if they're in a position where their funds are in limbo.

the spec requirements around failing a channel have always been a SHOULD, not a MUST, so if you relied on it being a MUST for your SCB you should have proposed changing that to a MUST or at least test the behavior of other implementations, that would have let us detect much earlier that implementations differed in their understanding of that part of the spec

That's a great point. We were originally going to make our SCB format/flow part of the spec, but then more time passed and people ended up diverging w.r.t what their seed derivation looked like (how to derive a node key from your seed, etc), etc. This was also before bLIPs were a thing as well. We'll make sure to do proper testing before the next release to make sure the behavior we expect is widely implemented.

Re the bit, I agree it doesn't make a lot of sense to go back and add one after the fact in this scenario. Just at times in the past, we have made modifications to the spec, that are actually sort of breaking changes, but sort of rely on the assumption that everyone will implement the change in quick order. If that isn't the case, or there're implementations that only focus on more fundamental compatibility (payments can work, etc), they'll sort of be left behind over time w/o anyway for them to determine what the behavior of a node would be (just signalling ofc) before they interact w/ them. At one point we had BOLT 1.0 and that was basically "LN". Since then, a lot of smaller stuff has changed that may not amount to much in isolation, but as a sum can trigger very distinct behavior from the PoV of a "BOLT 1.0" peer.

Jan 11 '23 01:01 Roasbeef

Do y'all also interpret this section as such?

Fwiw, our change here shouldn't have broken SCB, by my read of the code, we do some initial sanity checks first, and will fail if the pub/secret in the data loss protect is invalid before we warn on stale commitments.

Jan 13 '23 19:01 TheBlueMatt

Ok we dug a bit more, and in practice it seems that the issue is some CLN peers never actually get the error message we send (I was wrong earlier, we'll always send the error right after the bad chan reest message to prompt a close by either that error or the bad message). In an attempt to update behavior to match what's in this issue and the other spec PRs, CLN nodes disconnect right after sending the warning, so they'll never get the error message we send (to prompt the close).

This change causes a loop that looks like:

lnd node is hosed or did an SCB recovery
lnd node sends a bad channel_reestablish message, then sends an error right after.
CLN gets the bad chan_reest above, then sends a warning and disconnects.
- As a result, CLN never gets that error message to prompt the force close.
lnd tries to persistently reconnect to achieve resolution, then we repeat from the top.

As a result, the connection is stuck in a dance where an error message can't be delivered iiuc.

I attempted to trace the code changes a bit here: https://github.com/ElementsProject/lightning/issues/5876#issuecomment-1382383833

Jan 13 '23 21:01 Roasbeef

Good catch, that would explain the issue! In that case, cln indeed should not disconnect, otherwise the peer doesn't have the opportunity to do anything about the late reestablish.

Jan 16 '23 08:01 t-bast

Not necessarily? If they're erroring on the bogus message they should disconnect (or at least that's historically been their interpretation of the spec, not sure what it says anymore), and at least for warnings disconnect makes a lot of sense (it's used when the channel is out of sync and disconnect may fix it). Disconnection makes sense there, I think, the issue is just that the warming generation was too early and could still be an error.

Jan 16 '23 16:01 TheBlueMatt

Nodes shouldn't publish their commitment when receiving outdated `channel_reestablish`

Expected behavior

Non optimal behavior

Conclusion