Figure out monitor persistence failure
In general if a ChannelMonitor fails to persist we're screwed. Currently there's a bunch of code to let the user return failure but there's ultimately no way to handle many of them - if we fail to persist eg a payment preimage we're screwed and may lose funds. With some of the ongoing async work if the user fails to persist a monitor update, we may have some other channel that gets hung waiting for a persistence to complete.
We should rip out all the handling and tell users to either infinite-loop trying to persist or panic.
Hmm, is this a fundamental problem with the Lightning protocol, or is this just an edge case that the BOLTs (or possibly LDK itself) do not address? Even if users go into an infinite persist loop, the node could always just crash before it manages to successfully persist, meaning that funds loss will happen eventually when LDK is used at scale. Is there no other way to fix this?
Sorry I missed this comment, but, yea, its basically fundamental to lightning - if persistence fails we can't make progress on the channel. We could respond by FC'ing, but I'm not sure users really want that either. Crashing is likely the safest thing to do, but that obviously sucks.
After failing to persist then crashing, would it ever be possible to recover the channel again, or would a FC permanently become the only option going forward? I wonder if it might be helpful to implement a way for cooperative peers to help the holder 'catch up' so that the channel can resume operations? The peer can choose steal funds if they want to instead, sure, but they might also be more interested in keeping the channel open so they can continue to make fees through routing.
Depending on the exact state that hit disk last (and in the future, not depending on anything), restarting should/would restore the channel to normal operation just fine.
Moving this to another release. I still think we should just wholesale remove the ChannelMonitorUpdateStatus::UnrecoverableError variant entirely but others seem to disagree. Even if we do that, though, maybe it should be after we fully support async persist/InProgress to make handling long-lasting errors more workable, so shifting to 0.2.