Allow honest validators to reorg late blocks
Specify a function in fork choice get_proposer_head which returns the parent block upon which a proposer may build while possibly reorging the canonical head (if it was broadcast late and lacks sufficient attestation weight).
The mechanism is designed with safeguards built in so that it does not cause liveness failures:
- Only single-slot reorgs are attempted: block
N + 2may build on blockNif blockN + 1arrived late. - Participation must be approximately greater than
PARTICIPATION_THRESHOLD= 70%. REORG_WEIGHT_THRESHOLDis set to 20% (half of the proposer boost) so that an attacker must control up toPROPOSER_BOOST_SCORE - REORG_WEIGHT_THRESHOLD= 20% of attester weight in order to perform an ex-ante reorg. The attack works by releasing attestations for the block at slotN + 1around the same time as the honest proposer publishes the attempted reorg block atN + 2. The late attestations tip the weight of theN + 1block from< REORG_WEIGHT_THRESHOLDto> PROPOSER_BOOST_SCORE, meaning that blockN + 1will be preferred to blockN + 2.
For background on the development of this idea, see: https://github.com/sigp/lighthouse/pull/2860.
Pros
- Completely optional for clients and node operators, can be switched on and off via a flag. Low risk.
- Can be rolled out almost immediately unlike other late block mitigations which are still in R&D.
Cons
- Considerable complexity, particularly regarding the overriding of fork choice updates for the canonical head.
- Obsoleted by
(block, slot)fork choice. If a viable implementation of(block, slot)fork choice emerges then reorging late blocks will become the default behaviour. - Relies on proposer boost, which may be phased out in favour of other fork choice changes (e.g. view-merge).
Thanks to @potuz @mkalinin @terencechain @fradamt @casparschwa @djrtwo for feedback on this idea.
Since this is optional, it might be nice to separate it from other parts of the spec a bit more, similar to optimistic sync. We could contain the complexity a bit, especially if it might be phased out in the future.
What is the circuit-breaker mechanism for when latency is > 4s? Is the intention that checking "single_slot_reorg" and "participation_ok" should act as one, because, when latency is > 4s and blocks do not get attestations in their slot, the parent block would either be too old or not have enough weight?
If that's the case, I think this should be clarified. It might also be nice to have an explicit circuit-breaker that's easier to reason about, and gives clear liveness guarantees (e.g. explicitly turn off the reorg functionality whenever too many proposals are being missed)
@fradamt Yeah the single slot & participation conditions are meant to be the circuit-breaker for asynchrony (although I think the single slot condition alone is sufficient). I think the outcome is that with message delay $\text{4s} < \Delta < \text{16s}$ then every second block will be re-orged and we'll get 50% of blocks on average. Once the delay exceeds 16s then blocks will be forced to build on a parent from at least 2 slots prior and we'll get <50% of blocks on average, without the proposer re-org rules coming into effect at all. In other words the only change in behaviour is for the delay window $\text{4s} < \Delta < \text{16s}$.
It might also be nice to have an explicit circuit-breaker that's easier to reason about, and gives clear liveness guarantees (e.g. explicitly turn off the reorg functionality whenever too many proposals are being missed)
Are you thinking of something like the builder API circuit-breaker conditions about number of missed blocks in the last N blocks and number of epochs since finalization? We could add number-of-skipped-blocks check (while keeping single_slot_reorg), and replace the current participation check with the the epochs-since-finalization check? The main reason I'd like to keep the single_slot_reorg check is that it simplifies the implementation greatly in terms of calculating and caching things.
I've rebased this on dev and addressed some of the outstanding review comments. I think there are still several decisions we need to take before merging:
- Q1: Is the current location alongside the other fork choice functions the right location for these specs? As they're optional they could alternatively live in
./fork_choicealongside the safe block specs. - Q2: How should we address the tension between specifying a precise strategy, and giving implementations the freedom to explore more aggressive strategies? This has implications for how the spec is written, which tests we write, and what inferences we are able to make about re-orgs on live networks.
- A: max precision: Don't account for any deviations from the spec. Write test cases that enforce when clients should and shouldn't re-org (positive and negative tests, respectively).
- B: max freedom: Allow clients to deviate from the spec in any way they wish. Write some test cases that describe minimum scenarios in which clients should re-org, but don't specify any negative tests beyond very extreme cases where re-orgs are actually impossible.
- C: balance: Account for some acceptable deviations from the spec in tests. Write a mix of positive and negative tests while trying to avoid negative tests that would fail on current implementations. Allow implementations to skip some tests if they explore more exotic strategies in future.
I think my current preference for Q2 is option (C). I think the only outstanding difference between Prysm and Lighthouse is our handling of unrealized justification. Lighthouse currently won't re-org a block that updates unrealized justification, while Prysm will, because it assumes it can achieve equivalent justification by packing the block's attestations. ~~I'm not opposed to changing Lighthouse to follow Prysm and specifying this as a case where we should re-org.~~ I think Prysm's strategy is unsafe in general, so would prefer to keep the finalization check in the spec.
Given https://ethresear.ch/t/selfish-mining-in-pos/15551/3 I suggest we add a constant REORG_PARENT_WEIGHT_THRESHOLD with suggested value of uint64(160) and an extra condition
parent_weight = get_weight(store, parent_root)
parent_reorg_threshold = calculate_committee_fraction(justified_state, REORG_PARENT_WEIGHT_THRESHOLD)
parent_strong = parent_weight > parent_reorg_threshold
And require parent_strong = true to reorg.
I am starting to get more worried about this change -- the reason is that it implements a version of the block/slot fork choice, but without the mitigations that would have to be in place to avoid reorging under poor networking conditions.
Also, we should probably open a discussion about the 4s attestation timeline at this point. In order for attestations, we need (1) block propagation and (2) block validation all to happen in this time slot, which is starting to become more work (e.g. with 4844) and does not feel like the appropriate slot division anymore since the work in the other three thirds remains the same (and is overall less time critical).
without the mitigations that would have to be in place to avoid reorging under poor networking conditions.
There is a constraint on the slot of a parent block as a mitigation to poor network connections, i.e. parent_block.slot + 1 == head_block.slot. If a block was missed in a parent slot then no reorg is enforced.
Also, we should probably open a discussion about the 4s attestation timeline at this point.
I think we should raise it on one of the next calls. It would be nice to see the distribution of attestation and aggregates submitting over time into their respective slot phases.
There is a constraint on the slot of a parent block as a mitigation to poor network connections, i.e.
parent_block.slot + 1 == head_block.slot. If a block was missed in a parent slot then no reorg is enforced.
Yes, but that makes any slot whose parent was not reorged subject to a reorg. Seems pretty harsh!
I think we should raise it on one of the next calls. It would be nice to see the distribution of attestation and aggregates submitting over time into their respective slot phases.
Yeah, it's a good idea, I will check if we can get this data
I've pushed the change suggested by @potuz to address the "selfish mining" attack described here.
I would appreciate some help from @hwwhww writing tests. Alternatively we could merge as-is and add tests in a follow-up PR.
In addition to my previous reservations on this change, I also have a new one:
A fork choice change like this should not be taken light-heartedly. In fact, we explicitly don't want that people think they can just mess with the fork choice. As an example, if someone built a bribery protocol to reorg blocks intentionally, I would consider this an attack on Ethereum.
Therefore it might be best to actually roll back the change from clients now and re-start the discussion to make sure this is in a public forum and agreed by everyone, at least on ACDC.
This was widely discussed in ACD at the time and thoroughly agreed upon. The only reason this wasn't merged then is that we couldn't make public some aspects of pulled tips. The implementation complexity of these changes are deeply intricate to roll back lightly
Edit: we do have this behind a flag so we could easily revert that flag though
@dankrad I was keen to push ahead with this fix, because as Potuz said it had fairly broad support, but it also mitigated what I saw as a security issue introduced by MEV at the merge. Without late block re-orgs, proposers are incentivised to wait until the last moment to reveal their blocks, e.g. 10s into the slot. I think this damages the network more than intentionally re-orging blocks that are late.
Another thing is that this change does not modify fork choice, as in the part of fork choice that determines what the head should be. It's just an (incentive-compatible) tweak to the rules for choosing which block to build upon. Unlike bribe-based re-orgs, it also reinforces the honest behaviour of other validators: publishing blocks on time, and publishing attestations at or before the attestation deadline.
Like Prysm, Lighthouse has this feature behind a flag and could disable it by default (or permanently) if desired.
Hi @michaelsproul
I refactored the specs to re-arrange and reduce the duplicate code in get_proposer_head and should_override_forkchoice_update. It helps me to define the test cases. What do you think about this refactoring?
- Q1: Is the current location alongside the other fork choice functions the right location for these specs? As they're optional they could alternatively live in
./fork_choicealongside the safe block specs.
It reminds me of the confirmation rule PR, which defines the new feature in a new fork_choice/confirmation-rule.md file. I'm not 100% sure if adding a new doc is necessary, but it would be better to clarify which parts are "optional" and which are "honest validator fork choice rule".
If
should_override_forkchoice_updatereturnsTruebutget_proposer_headlater chooses the canonical head rather than its parent, then this is a misprediction that will cause the node to construct a payload with less notice. The result ofget_proposer_headMUST be honored in preference to the heuristic method.
I saw that get_proposer_head is optional in phase0 FC, but here, it's not clear to me if should_override_forkchoice_update is MUST to enable. Could you clarify a bit?
- Q2: How should we address the tension between specifying a precise strategy, and giving implementations the freedom to explore more aggressive strategies? This has implications for how the spec is written, which tests we write, and what inferences we are able to make about re-orgs on live networks.
IMO clients can have the flexibility to implement it into a more efficient algorithm, but in regard to the result of the computed head, it should be identical because of the risk of chain split. I doubt spec tests have covered everything. 😅
Another thing is that this change does not modify fork choice, as in the part of fork choice that determines what the head should be.
It modifies the rules on how to vote for a block. This is what a bribery protocol to e.g. censor certain transactions permanently would do, too.
It's just an (incentive-compatible) tweak to the rules for choosing which block to build upon. Unlike bribe-based re-orgs, it also reinforces the honest behaviour of other validators: publishing blocks on time, and publishing attestations at or before the attestation deadline.
Again, I'm not so much against this change, but now feel that it needs broader agreement and is not something that clients should be able to implement before that is reached. Maybe this is something to at least discuss on next ACD if people generally agree with this change.
FWIW, we have so far not implemented this in Nimbus (yet), for the simple reason it's not part of the spec.
+1 that having things like this behind flags dubious at best and should be considered a temporary solution to address urgent problems and not a permanent state.
It modifies the rules on how to vote for a block. This is what a bribery protocol to e.g. censor certain transactions permanently would do, too.
Edit: ah I see by voting you mean the vote the proposer is casting by asserting his head was the previous one. Will leave this comment in case someone else also took "attesting" for the meaning of "voting"
I don't see how this modifies voting rules:: attesters in the orphaned committee would not have attested for the late block cause they sent attestations at 4 seconds. Attesters in the next committee are following the usual forkchoice and proposer boost mandates them to vote for the reorging block.
having things like this behind flags dubious at best and should be considered a temporary solution to address urgent problems and not a permanent state.
Just to clarify this, the only reason Prysm has this behind a flag is to protect from an implementation bug that can cause a liveness issue. We do this with most changes that touch core functions like block production. This has nothing to do with the feature being or not merged in the spec. As far as we were concerned this was thoroughly discussed and agreed in ACD albeit some reservations about disclosing because of certain forkchoice attacks that's were present at the time
Hi @hwwhww, sorry for the slow reply! Thanks for cleaning this up and adding tests, I really appreciate that :blush:
I saw that get_proposer_head is optional in phase0 FC, but here, it's not clear to me if should_override_forkchoice_update is MUST to enable. Could you clarify a bit?
I tried to clarify the wording of this section in 3f1bc2051294d7ea632b5010500f98a331119d39. Both functions are optional, and should_override_forkchoice_update should only be enabled when the proposer reorg feature is enabled. This is how Lighthouse and Prysm work currently.
check if we should output get_proposer_head result. It is labeled optional so I hesitated. also, consider moving get_proposer_head to a separate place
I think the tests should include get_proposer_head as well, as it is arguably more important than should_override_forkchoice_update. If clients don't implement it they can simply skip those checks in their fork choice test runner.
I also think we should just leave it in the current location, as it seems like we may want to make this feature mandatory in future (Potuz mentioned that several PBS designs rely on it, but I haven't dug deeper).
confirm if test format is ok and run with a client
I'll try running the test vectors with Lighthouse soon. Maybe once they also include get_proposer_head checks.
@michaelsproul thank you!
I updated the test format: https://github.com/ethereum/consensus-specs/blob/d8440f8bb4496d053a36b5fef64420d5149156bb/tests/formats/fork_choice/README.md#checks-step
- add
get_proposer_head - add mocking
validator_is_connectedintoshould_override_forkchoice_updatecheck.
Now running the fork-choice testgen.
@michaelsproul
FC-only test vectors of commit d8440f8
FC-only test vectors of commit https://github.com/ethereum/consensus-specs/commit/d8440f8bb4496d053a36b5fef64420d5149156bb
Tests are passing on Lighthouse (https://github.com/sigp/lighthouse/pull/4887). I expected at least one test to fail because Lighthouse doesn't have the REORG_PARENT_WEIGHT_THRESHOLD check. This implies the test coverage could be expanded with a case where the reorg would go ahead if it weren't for the low weight of the parent (i.e. where head_weight < REORG_WEIGHT_THRESHOLD && parent_weight < REORG_PARENT_WEIGHT_THRESHOLD).