rust-lightning Figure out monitor persistence failure

In general if a ChannelMonitor fails to persist we're screwed. Currently there's a bunch of code to let the user return failure but there's ultimately no way to handle many of them - if we fail to persist eg a payment preimage we're screwed and may lose funds. With some of the ongoing async work if the user fails to persist a monitor update, we may have some other channel that gets hung waiting for a persistence to complete.

We should rip out all the handling and tell users to either infinite-loop trying to persist or panic.

Apr 06 '23 22:04 TheBlueMatt

Hmm, is this a fundamental problem with the Lightning protocol, or is this just an edge case that the BOLTs (or possibly LDK itself) do not address? Even if users go into an infinite persist loop, the node could always just crash before it manages to successfully persist, meaning that funds loss will happen eventually when LDK is used at scale. Is there no other way to fix this?

Oct 10 '23 00:10 MaxFangX

Sorry I missed this comment, but, yea, its basically fundamental to lightning - if persistence fails we can't make progress on the channel. We could respond by FC'ing, but I'm not sure users really want that either. Crashing is likely the safest thing to do, but that obviously sucks.

Aug 30 '24 15:08 TheBlueMatt

After failing to persist then crashing, would it ever be possible to recover the channel again, or would a FC permanently become the only option going forward? I wonder if it might be helpful to implement a way for cooperative peers to help the holder 'catch up' so that the channel can resume operations? The peer can choose steal funds if they want to instead, sure, but they might also be more interested in keeping the channel open so they can continue to make fees through routing.

Aug 30 '24 17:08 MaxFangX

Depending on the exact state that hit disk last (and in the future, not depending on anything), restarting should/would restore the channel to normal operation just fine.

Sep 01 '24 17:09 TheBlueMatt

Moving this to another release. I still think we should just wholesale remove the ChannelMonitorUpdateStatus::UnrecoverableError variant entirely but others seem to disagree. Even if we do that, though, maybe it should be after we fully support async persist/InProgress to make handling long-lasting errors more workable, so shifting to 0.2.

Nov 26 '24 15:11 TheBlueMatt