lnd icon indicating copy to clipboard operation
lnd copied to clipboard

lnd: implement "safe mode" node stand up

Open Roasbeef opened this issue 5 years ago • 13 comments

Although we now have the proper base set of tools in place (SCB) to enable nodes to safely reclaim their channels in the event of data loss, it's still possible that a node boots up with stale data. If this is the case, then the node is at risk of breaching the other channel peer inadvertently. Due to systems like the contractcourt which will automatically force close channels that have expired HTLCs, this can happen in an automated fashion on restarts.

Rather than resuming normal operation if one knows they may be restoring with an out dated state, we can instead implement a "safe mode" of sorts. When users boot up in this mode, all commitment broadcasts are forbidden. Once lnd has booted up, the user can then examine the set of channel states to see if they become borked once we connect to peers (indicative of local data loss).

Steps To Completion

  • [ ] Add new --safemode config parameter to the lnd binary.

  • [ ] If safe mode is enabled, reject all RPC level force close requests.

  • [ ] If safe mode is enabled, reject all automated force close requests by channel arbitrators.

Roasbeef avatar Jul 10 '19 02:07 Roasbeef

I am going to take a shot at working on this.

mlerner avatar Jul 28 '19 04:07 mlerner

It would be nice to have a flag on force close to do a "dangerous" force close

alexbosworth avatar Oct 15 '19 13:10 alexbosworth

@alexbosworth: Is the case in which that would be helpful when you want to make sure that no other channels are force-closed besides ones you specifically choose? I could see that being useful.

What do you think about adding a confirmation message in the case of a "dangerous force close" command received over RPC (I'm not sure if there is a precedent for that type of user interaction)? Also, I'm assuming that automated force closing requests would still be prohibited.

Alternatively, one could argue that "safe mode" should prevent a user from performing dangerous operations, and that the user should restart lnd without "safe mode" on if they want to do dangerous things like force close channels with a node in an outdated state - part of the goal of this feature is to allow a user to start lnd when they "know it is in an outdated state".

mlerner avatar Oct 15 '19 16:10 mlerner

The use case I am specifically thinking about is one where you have an out of date backup and you want to use it to recover funds

I'm not sure if this is handled in this PR, but blanket banning force closes seems risky to me in the event of race conditions relating to HTLC resolution

So where I would see using this is:

  1. User has an out of date backup
  2. They load their out of date backup into safe mode
  3. They recover as much as they can in safe mode, knowing that they are protected from breaching
  4. When that is finished, they decide for themselves the risk of force closing with unresponsive peers, hopefully after a long period of time of no connectivity

alexbosworth avatar Oct 23 '19 13:10 alexbosworth

Should possibly also reject channels state updates.

halseth avatar Oct 31 '19 09:10 halseth

Should possibly also reject channels state updates.

This is already done for restored channels

(sorry, tabbed to Close and Comment lol)

cfromknecht avatar Oct 31 '19 17:10 cfromknecht

Should possibly also reject channels state updates.

This is already done for restored channels

Yeah, but in this case regular channels won't be marked borked/restored, so they can still have updates.

halseth avatar Nov 01 '19 07:11 halseth

why wouldn't they? isn't this supposed to be used after restoring w/ SCB?

cfromknecht avatar Nov 02 '19 20:11 cfromknecht

It's a little unclear when it's possible to leave "safe mode" and resume normal operation. If our node has a bad state, contacts the other peer to initiate a force close, and then leaves safe mode before the peer's force close tx is confirmed, it's possible for our node to force close.

Also I agree with @alexbosworth above that if all force closes are banned, there could be some legitimate, synced channels which need to be force closed but aren't.

Crypt-iQ avatar Nov 04 '19 00:11 Crypt-iQ

why wouldn't they? isn't this supposed to be used after restoring w/ SCB?

If you are restoring from SCBs then I don't think safe mode is necessary, since you don't have any toxic data.

halseth avatar Nov 07 '19 09:11 halseth

why wouldn't they? isn't this supposed to be used after restoring w/ SCB?

If you are restoring from SCBs then I don't think safe mode is necessary, since you don't have any toxic data.

Yup we reviewed this PR in the lnd review club and conner also suggested maybe disabling several features like no bootstrap, no graph sync, no channel acceptance in addition to no force closures

Crypt-iQ avatar Nov 07 '19 12:11 Crypt-iQ

I think the users don't always know whether they are in a old state, so I wonder if it makes sense to delay the channel_arbitrator actions like e.g. going on chain for an expired HTLC but instead at least wait for the peer connection to build up, because their a wrong state of the channel would cause our peer to Force-Close the channel avoiding probably that we will go onchain with the wrong state.

ziggie1984 avatar Feb 01 '24 10:02 ziggie1984

@ziggie1984 good point. One way would be to have a mode to start in safe mode (so ppl could do it all the time), then later on check an endpoint to see if any actions would've' been executed, then allow an API call to upgrade to regular operation.

so I wonder if it makes sense to delay the channel_arbitrator actions like e.g. going on chain for an expired HTLC but instead at least wait for the peer connection to build up

FWIW, this would be the opposite of what was suggested in: https://github.com/lightningnetwork/lnd/issues/8166

I think a middle ground could make sense though. Need to think about it further.

Roasbeef avatar Feb 01 '24 20:02 Roasbeef

Perhaps safe mode could be automatically enabled on startup if the node is more than X blocks behind the chain. The more blocks behind, the more likely the DB is out-of-date.

This would allow #8166 DoS protections in the crash and restart case, while a node that's been offline for a while will use safe mode until upgraded.

morehouse avatar Mar 28 '24 19:03 morehouse