eclair
eclair copied to clipboard
Alternate strategies for outdated commitments
This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.
Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).
Strategies for outdated commitments:
- request the counterparty to close the channel (default).
- if the node was restarted less than 10 min ago, log an error message and stop the node
Strategies for unhandled exceptions:
- local force close of the channel (default)
- log an error message and stop the node
Default settings maintain the same behavior as before.
I don't think these two commits should be done in the same PR
Agreed, I got a bit carried away and kept experimenting.
I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightningnetwork/lightning-rfc#834 and reviewing in which case we don't have a choice and really need to force-close.
IIUC lightningnetwork/lightning-rfc#834 deals with protocol errors, but here I'm trying to address the case where we have an internal error. Disk full, OOM exception, that sort of things.
As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...
Exactly, it requires human intervention, that's why this mode is for more advanced users. But it's less susceptible to blind mass force-close, which is what this PR aims to address. I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.
Exactly, it requires human intervention, that's why this mode is for more advanced users.
So the flow if the node stops because of an outdated commitment would be:
- The node stops
- Investigate your backups
- If you had a DB issue, restart after fixing your DB
- Otherwise if you find nothing and think you really are late, change your config to
always-request-remote-close
and restart, and accept losing channels
Is that right?
I think it should be documented clearly, probably a new document in the docs
folder.
I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.
I agree, the other solutions I can think of are just proxies for almost the same behavior, so it's better to choose the simplest one.
I don't think it's an area where we'll want many different strategies though, so maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false
or something similar?
I think it should be documented clearly, probably a new document in the docs folder.
Agree!
maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?
It was a left-over of my previous attempt, I flip-flopped around this and thought keeping an explicit strategy name would be easier to understand ... but it is probably not.
I don't think these two commits should be done in the same PR,
Agreed, I got a bit carried away and kept experimenting.
On 2nd thought, I kept both changes in the same PR, I think it makes sense since those are related topics.
I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightningnetwork/lightning-rfc#834 and reviewing in which case we don't have a choice and really need to force-close.
I could add a third strategy, to not force-close even if our peer sends an error
, like lnd
does (used to do?). But that would'nt be spec compliant, especially since we have warning
messages now.
I think it should be documented clearly, probably a new document in the docs folder.
Done with 93d07c4ec.
Codecov Report
Merging #1838 (8fda63e) into master (765a0c5) will decrease coverage by
0.10%
. The diff coverage is76.41%
.
@@ Coverage Diff @@
## master #1838 +/- ##
==========================================
- Coverage 87.61% 87.51% -0.11%
==========================================
Files 161 161
Lines 12575 12609 +34
Branches 552 532 -20
==========================================
+ Hits 11018 11035 +17
- Misses 1557 1574 +17
Impacted Files | Coverage Δ | |
---|---|---|
...air-core/src/main/scala/fr/acinq/eclair/Logs.scala | 81.35% <0.00%> (-1.41%) |
:arrow_down: |
...c/main/scala/fr/acinq/eclair/channel/Channel.scala | 87.56% <67.39%> (-1.14%) |
:arrow_down: |
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala | 91.30% <71.42%> (-0.63%) |
:arrow_down: |
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala | 94.78% <83.72%> (-0.79%) |
:arrow_down: |
...in/scala/fr/acinq/eclair/channel/Commitments.scala | 94.65% <100.00%> (+0.14%) |
:arrow_up: |
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala | 79.77% <0.00%> (-0.91%) |
:arrow_down: |
...ore/src/main/scala/fr/acinq/eclair/Timestamp.scala | 100.00% <0.00%> (ø) |
|
...chain/bitcoind/rpc/BasicBitcoinJsonRPCClient.scala | 100.00% <0.00%> (ø) |
|
...main/scala/fr/acinq/eclair/io/PeerConnection.scala | 84.58% <0.00%> (+0.41%) |
:arrow_up: |
... and 1 more |
So, while testing this PR I realized that in an "outdated commitment" scenario where we are on the up-to-date side, we always react by force-closing the channel immediately, not giving our peer a chance to fix their data and restart. On top of that, we consider this a commitment sync
error, instead of clearly logging that our counterparty is using outdated data.
Addressing this turned out to be rabbit-holey: our sync code is quite complicated and is a bit redundant because we separate between:
- checking whether we are late
- deciding what messages we need to retransmit
Also, discovered a missing corner case when syncing in SHUTDOWN
state.
Obviously this code is very sensitive, because a bug there could trigger a mass force close, which is what we are trying to avoid with this PR after all. If we choose to update our sync code, we could first deploy a version that only compares legacy/new sync results for a few weeks on our prod nodes to get a level of confidence.
Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)
Done with 9957a83. I could put the full doc, but I don't think we should inflate the release note too much? On the other hand it would save us from using a relative link, which will break at some point.
Rebased on top of #2036, the diff is now much smaller.
Rebased on top of #2037.
This PR is not immediately useful due to how other implementations react when they detect that the peer is using an outdated commitment: they immediately force-close. Actually it would be dangerous to not use the default strategy in a non-static-remote-key, non-anchor-output commitment, because the peer will not resend their channel_reestablish
which tells us to to spend our main balance.