Improve documentation around resolving "stuck" co-op channel closes
The spec outlines a few scenarios where channels should be force closed after a co-op close attempt stops making progress after some time. LDK will currently only adhere to this after two ChannelManager::timer_tick_occurred if the peer is still connected. If the peer is disconnected, and has been for a while, implementations will need to decide by themselves when it's appropriate to force close channels in this state. To help them do so, we should expose some notion of when the peer was last offline (ideally as a block height).
I think we do force-close in ~all the cases where the spec says we should, modulo it being unclear what we should do if we're "waiting on a message" from a peer that's disconnected.
Still, I don't think this is something shutdown-specific - in general LDK's principle on automated shutdown is "if we're screwed or need to get htlc funds" other than that we expect downstream devs to have some concept of "when did we last hear from this peer, if its been a long time auto-force-close, or maybe let the user do it". For a routing node, I believe most today let users do this manually and don't do anything automatically, which is what we do as well. For a mobile node, if you're tied to an LSP, similarly you probably shouldn't close cause if your LSP is down you can't open a new channel anyway, though if you are an auto-channel-picking node you may wish to have some automated force-close logic.
In general for that to exist we need to track when we last spoke to a peer, and should expose that info. Once we have that, we could add some concept of a formal "shutdown timeout" (maybe as a function of N blocks after HTLCs all time out) and force-close after that expires. Its unclear to me how this fits into the broader "users should force-close if we havent spoken to a peer in a year in general" world - if you are trying to coop shutdown with an LSP with intent to open a new channel to that LSP, there's not much reason to force-close quickly if the LSP is offline. If we're a routing node, however, you obviously do want to force-close if we're trying to shutdown and the peer is gone (probably?).
All that to say, I think we need to have some kind of formal thinking about when we expect users to force-close and when we'll do it (absent errors that mandate closure) before we jump to add more force-closes here or there.
A straw-man for this:
let's not worry about the general (non-shutdown) case - just document more clearly that users should consider timing out channels after a while, aside from tracking the last time we heard from a peer and exposing that.
Then, for shutdown, remove the return of the top-level shutdown, and set the timer even if the peer is disconnected. The timer should be a function of N blocks after htlcs expire. We should hide the complexity from the user, but this isn't actually trivial, we could send the shutdown and then not get the peer's shutdown back for a while, during which time they could add more HTLCs, so we'll need to apply something like the N to "time until peer sends a shutdown" (maybe just like 3 blocks instead), and apply it similarly if the peer is offline (if they're offline, give them N blocks to come online if there's no HTLCs, if there are HTLCs we'll already force-close, if they do come online give them the above timeline to respond with a shutdown). This would make the worst-case time basically N blocks (until they come online) then HTLC-max-distance (for an HTLC they add immediately after coming back online) then another N after that. Is that okay?
if they're offline, give them N blocks to come online if there's no HTLCs, if there are HTLCs we'll already force-close
Because the HTLCs will expire or something else?
This would make the worst-case time basically N blocks (until they come online) then HTLC-max-distance (for an HTLC they add immediately after coming back online) then another N after that.
If the counterparty adds an HTLC and it clears after we've sent our shutdown and:
- they send back shutdown and disconnect: do we wait N blocks again for them to come back online or just the last N after the HTLCs clear?
- they disconnect without sending shutdown: do we wait N blocks again for them to come back online or do we resume waiting on the initial N block timeout?
Because the HTLCs will expire or something else?
Yea, they'll expire and we'll close.
they send back shutdown and disconnect: do we wait N blocks again for them to come back online or just the last N after the HTLCs clear?
I assume N after the HTLCs clear?
they disconnect without sending shutdown: do we wait N blocks again for them to come back online or do we resume waiting on the initial N block timeout?
I assume we just force-close - we treat it as "N after the shutdown to get back a shutdown"?
Dropping the milestone as its not entirely clear to me we want to do anything here, vs letting the user handle the FC if shutdown is hung (we do now expose the ChannelShutdownState in the ChannelDetails so users can probably see why).