eclair icon indicating copy to clipboard operation
eclair copied to clipboard

Alternate strategies for outdated commitments

Open pm47 opened this issue 3 years ago • 9 comments

This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:

  • request the counterparty to close the channel (default).
  • if the node was restarted less than 10 min ago, log an error message and stop the node

Strategies for unhandled exceptions:

  • local force close of the channel (default)
  • log an error message and stop the node

Default settings maintain the same behavior as before.

pm47 avatar Jun 07 '21 08:06 pm47

I don't think these two commits should be done in the same PR

Agreed, I got a bit carried away and kept experimenting.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightningnetwork/lightning-rfc#834 and reviewing in which case we don't have a choice and really need to force-close.

IIUC lightningnetwork/lightning-rfc#834 deals with protocol errors, but here I'm trying to address the case where we have an internal error. Disk full, OOM exception, that sort of things.

As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...

Exactly, it requires human intervention, that's why this mode is for more advanced users. But it's less susceptible to blind mass force-close, which is what this PR aims to address. I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

pm47 avatar Jun 07 '21 12:06 pm47

Exactly, it requires human intervention, that's why this mode is for more advanced users.

So the flow if the node stops because of an outdated commitment would be:

  1. The node stops
  2. Investigate your backups
  3. If you had a DB issue, restart after fixing your DB
  4. Otherwise if you find nothing and think you really are late, change your config to always-request-remote-close and restart, and accept losing channels

Is that right? I think it should be documented clearly, probably a new document in the docs folder.

I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

I agree, the other solutions I can think of are just proxies for almost the same behavior, so it's better to choose the simplest one. I don't think it's an area where we'll want many different strategies though, so maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

t-bast avatar Jun 07 '21 12:06 t-bast

I think it should be documented clearly, probably a new document in the docs folder.

Agree!

maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

It was a left-over of my previous attempt, I flip-flopped around this and thought keeping an explicit strategy name would be easier to understand ... but it is probably not.

pm47 avatar Jun 08 '21 13:06 pm47

I don't think these two commits should be done in the same PR,

Agreed, I got a bit carried away and kept experimenting.

On 2nd thought, I kept both changes in the same PR, I think it makes sense since those are related topics.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightningnetwork/lightning-rfc#834 and reviewing in which case we don't have a choice and really need to force-close.

I could add a third strategy, to not force-close even if our peer sends an error, like lnd does (used to do?). But that would'nt be spec compliant, especially since we have warning messages now.

I think it should be documented clearly, probably a new document in the docs folder.

Done with 93d07c4ec.

pm47 avatar Sep 09 '21 15:09 pm47

Codecov Report

Merging #1838 (8fda63e) into master (765a0c5) will decrease coverage by 0.10%. The diff coverage is 76.41%.

@@            Coverage Diff             @@
##           master    #1838      +/-   ##
==========================================
- Coverage   87.61%   87.51%   -0.11%     
==========================================
  Files         161      161              
  Lines       12575    12609      +34     
  Branches      552      532      -20     
==========================================
+ Hits        11018    11035      +17     
- Misses       1557     1574      +17     
Impacted Files Coverage Δ
...air-core/src/main/scala/fr/acinq/eclair/Logs.scala 81.35% <0.00%> (-1.41%) :arrow_down:
...c/main/scala/fr/acinq/eclair/channel/Channel.scala 87.56% <67.39%> (-1.14%) :arrow_down:
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 91.30% <71.42%> (-0.63%) :arrow_down:
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala 94.78% <83.72%> (-0.79%) :arrow_down:
...in/scala/fr/acinq/eclair/channel/Commitments.scala 94.65% <100.00%> (+0.14%) :arrow_up:
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 79.77% <0.00%> (-0.91%) :arrow_down:
...ore/src/main/scala/fr/acinq/eclair/Timestamp.scala 100.00% <0.00%> (ø)
...chain/bitcoind/rpc/BasicBitcoinJsonRPCClient.scala 100.00% <0.00%> (ø)
...main/scala/fr/acinq/eclair/io/PeerConnection.scala 84.58% <0.00%> (+0.41%) :arrow_up:
... and 1 more

codecov-commenter avatar Oct 14 '21 13:10 codecov-commenter

So, while testing this PR I realized that in an "outdated commitment" scenario where we are on the up-to-date side, we always react by force-closing the channel immediately, not giving our peer a chance to fix their data and restart. On top of that, we consider this a commitment sync error, instead of clearly logging that our counterparty is using outdated data.

Addressing this turned out to be rabbit-holey: our sync code is quite complicated and is a bit redundant because we separate between:

  • checking whether we are late
  • deciding what messages we need to retransmit

Also, discovered a missing corner case when syncing in SHUTDOWN state.

Obviously this code is very sensitive, because a bug there could trigger a mass force close, which is what we are trying to avoid with this PR after all. If we choose to update our sync code, we could first deploy a version that only compares legacy/new sync results for a few weeks on our prod nodes to get a level of confidence.

pm47 avatar Oct 15 '21 14:10 pm47

Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)

Done with 9957a83. I could put the full doc, but I don't think we should inflate the release note too much? On the other hand it would save us from using a relative link, which will break at some point.

pm47 avatar Oct 26 '21 13:10 pm47

Rebased on top of #2036, the diff is now much smaller.

pm47 avatar Oct 27 '21 09:10 pm47

Rebased on top of #2037.

This PR is not immediately useful due to how other implementations react when they detect that the peer is using an outdated commitment: they immediately force-close. Actually it would be dangerous to not use the default strategy in a non-static-remote-key, non-anchor-output commitment, because the peer will not resend their channel_reestablish which tells us to to spend our main balance.

pm47 avatar Oct 27 '21 16:10 pm47