lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Enable proposer boost re-orging

Open michaelsproul opened this issue 3 years ago • 11 comments

Proposed Changes

With proposer boosting implemented (#2822) we have an opportunity to re-org out late blocks.

This PR adds two flags to the BN to control this behaviour:

  • --enable-proposer-re-orgs={false,true}: turn aggressive re-orging on, or off (default).
  • --proposer-re-org-fraction=N: attempt to orphan blocks with less than N% of the committee vote. If this parameter isn't set then N defaults to 10% when the feature is enabled.

For safety Lighthouse will only attempt a re-org under very specific conditions:

  1. The block being proposed is 1 slot after the canonical head, and the canonical head is 1 slot after its parent. i.e. at slot n + 1 rather than building on the block from slot n we build on the block from slot n - 1.
  2. The current canonical head received less than N% of the committee vote. N should be set depending on the proposer boost fraction itself, the fraction of the network that is believed to be applying it, and the size of the largest entity that could be hoarding votes.
  3. The current canonical head arrived after the attestation deadline from our perspective. This condition was only added to support suppression of forkchoiceUpdated messages, but makes intuitive sense.
  4. The block is being proposed in the first 2 seconds of the slot. This gives it time to propagate and receive the proposer boost.

Additional Info

For the initial idea and background, see: https://github.com/ethereum/consensus-specs/pull/2353#issuecomment-950238004

michaelsproul avatar Dec 08 '21 05:12 michaelsproul

This is blocked on the implementation of proposer boosting in other clients (namely Prysm and Nimbus).

I tested this on Prater and attempted some re-orgs, but failed every time due to Teku and Lighthouse not having enough network share to out-vote the others.

michaelsproul avatar Jan 07 '22 03:01 michaelsproul

As @paulhauner points out, we should be careful not to try to re-org out blocks that have updated the justified/finalized checkpoint such that excluding them would not change the justified checkpoint. We could do this simply by not re-orging blocks at slot offset 31 in an epoch, or with some more complicated analysis

michaelsproul avatar Feb 24 '22 01:02 michaelsproul

For the record, if this pans out it would be an excellent addition to the honest validator spec so that it becomes expected behaviour for validators to penalise late blocks.

ajsutton avatar Feb 24 '22 02:02 ajsutton

Based on the currently running experiment I think this feature is working well. I've got 5K validators on Prater proposing late blocks, and another 25K validators running with --enable-proposer-re-orgs=true.

So far ~59% of the re-orgs attempted have succeeded, and I think the failures are due to some validators on Prater not applying the boost. I'm in communication with the large Prater stakeholders to increase the number of boosting validators so we can confirm the behaviour on a more uniform network.

Spreadsheet of all recently attempted re-orgs here: https://docs.google.com/spreadsheets/d/1Of-VB-MNoWmg1bM8gI6sRuef3kgxQD1pQzMRzwVp6II

michaelsproul avatar Mar 07 '22 04:03 michaelsproul

I stumbled on this idea independently today when discussing with flashbots people how they (or mev hungry folks) might abuse block delivery times later than 4s to wait and gather more MEV.

My answer was "if proposers at N are dishonest and release blocks late, proposers at N+1 are likely to be dishonest and reorg them using proposer boost".

It's kind of a poor man's (block, slot).

That said... I'd like to surface this to a larger conversation and consider it as a timeliness mitigation in the specs rather than this as optional behavior in a single client type. Done incorrectly, this can be dangerous. E.g., if you didn' have the strict rules about prev-slot and parent-prev-slot and we have >4s latencies, all blocks will be orphaned. I'm glad you did think this particular edge case through to bound the cascading behavior, but if we leave it to many different, disparate client implementations, we might get unexpected outcomes.

djrtwo avatar May 05 '22 00:05 djrtwo

we should be careful not to try to re-org out blocks that have updated the justified/finalized checkpoint such that excluding them would not change the justified checkpoint. We could do this simply by not re-orging blocks at slot offset 31 in an epoch, or with some more complicated analysis

I don't think you'll need this mitigation after we include the unrealized justified/finalized leaves fix discussed at devconnect. That is, any leaf in N should have it's justificaiton/finalization info be consider in N+1 based on the what would happn in that epoch transition regardless of if a block transitioned it into that epoch.

djrtwo avatar May 05 '22 00:05 djrtwo

That said... I'd like to surface this to a larger conversation and consider it as a timeliness mitigation in the specs rather than this as optional behavior in a single client type.

Agree. I was planning to take Adrian's advice and raise a PR to the honest validator spec. Do you think the conditions I have at the moment are neat enough? They feel a little bit artificial, and perhaps trigger-happy validators will be tempted to loosen them, but that might be the best we can do.

I don't think you'll need this mitigation after we include the unrealized justified/finalized leaves fix discussed at devconnect.

I was thinking about this more yesterday, and I think it's probably still better to play it safe at the epoch boundary. The block in slot 31 might include just enough attestations to justify the epoch (while the block at slot 30 does not). If we build a new block at 32 on 30, then it will have a lower justified checkpoint than 31, which I think will make it inferior from fork choice's PoV, even with the newest changes?

michaelsproul avatar May 05 '22 04:05 michaelsproul

I was thinking about this more yesterday, and I think it's probably still better to play it safe at the epoch boundary. The block in slot 31 might include just enough attestations to justify the epoch (while the block at slot 30 does not). If we build a new block at 32 on 30, then it will have a lower justified checkpoint than 31, which I think will make it inferior from fork choice's PoV, even with the newest changes?

Oh I was thinking about this situation the other day and even made it a unit test https://github.com/prysmaticlabs/prysm/blob/bedba3f879145e067938ce775808b9408d512669/beacon-chain/forkchoice/doubly-linked-tree/unrealized_justification_test.go#L143 That block could come even 31 slots late before we include block 64 and still be head.

potuz avatar May 08 '22 16:05 potuz

It's kind of a poor man's (block, slot).

I agree with this take, but I am worried that both this PR, as well as a proper (block, slot)-voting won't be able to prevent late block proposing / attesting.

I wrote down my thoughts in a HackMD in response to a Flashbots GitHub issue. tldr:

  • Proposers are incentivized to propose blocks as late as possible (more tx mempool listening time, more MEV), while still making sure to become canonical.
  • On the surface it seems like reorging late blocks via PB (this PR) ensures that validators propose at least by 4s into slot.
  • But once validators start proposing around the attestation deadline, you start to introduce incentives to attest just a little bit later. Basically you want to be sure to not vote too early and accidentally vote for the parent, while most of the committee votes for the late block. More details in HackMD.
  • This cycle could end up with super late block proposals / attestations.
  • This is particularly worrisome because validators can converge towards later attestation times individually, there is no need for collectivley colluding on the timing equilibrium.

It turns out that the above logic not only applies to this PR, but unfortunatelty also to (block, slot)-voting. As soon as proposers start releasing their blocks around the attestation deadline we have the potential to run into this vicious cycle that ends up with late block proposals / attestations. At least this is my worry/understanding at the moment, more details in the linked HackMD.

Because of this I am starting to think more about how to incentivize timely block releasing specifically. Currently the block proposer is rewarded in proportion to the profitability of the attestations they include in their block. Instead we could try to also account for the proposer’s timeliness using some heuristic. One heuristic could be to scale the proposer’s reward by the share of same-slot committee votes that the block receives and are included in the subsequent block. At this point this is definitely more of an idea than a proposal. Also it won't be effective if MEV rewards are dominating timeliness rewards. Regardless, I think it can help keep the timing equilibria in line and incentivizing it specifically would make sense either way imho. A high level write-up can be found here.

casparschwa avatar May 09 '22 15:05 casparschwa

Getting some good data from testing this on Ropsten. I've got 2.5k validators proposing blocks after 3.6 seconds, and 5k validators trying to re-org them.

We see examples of the late blocks getting more votes than the threshold and not being re-orged:

Aug 17 14:43:15.873 ERRO Block broadcast was delayed root: 0xcff1cb6d146eae25a74c316038d41b825a0f4415376fb723c81eff6dcddac860, slot: 568716, delay_ms: 3775, msg: system may be overloaded, block may be orphaned Aug 17 14:43:24.005 DEBG Not attempting re-org due to strong head, re_org_weight: Some(10609231565033), head_weight: Some(22560000000000), head: 0xcff1cb6d146eae25a74c316038d41b825a0f4415376fb723c81eff6dcddac860, service: beacon

(each line from a different node)

We also see examples of the blocks getting less votes than the threshold and getting successfully re-orged:

Aug 17 13:05:27.906 ERRO Block broadcast was delayed root: 0x8da85437922f362ee35257212c0da0dab8d4b949810142b82aa0da7cac4222f0, slot: 568227, delay_ms: 3735, msg: system may be overloaded, block may be orphaned Aug 17 13:05:36.465 DEBG Attempting re-org due to weak head re_org_weight: Some(10609231435377), head_weight: Some(9344000000000), re_org_head: 0x30faf5b790ebb849e19f64cdfc2b22bac318d61151d4e1307c8d1b3dc8254cc0, head: 0x8da85437922f362ee35257212c0da0dab8d4b949810142b82aa0da7cac4222f0, service: beacon Aug 17 13:05:36.872 WARN Beacon chain re-org reorg_distance: 1, new_slot: 568228, new_head: 0x2ef26b775963bbc3a90c7f435528af835cbfe0da7646ef77978815568daf7a87, previous_slot: 568227, previous_head: 0x8da85437922f362ee35257212c0da0dab8d4b949810142b82aa0da7cac4222f0, service: beacon

michaelsproul avatar Aug 18 '22 00:08 michaelsproul

TODO:

  • [x] Add participation circuit breaker
  • [x] Filter equivocating validators from unique_block_weight
  • [x] More tests

michaelsproul avatar Sep 20 '22 08:09 michaelsproul

Marking this ready for review.

Guide for reviewers:

  • Spec is here: https://github.com/ethereum/consensus-specs/pull/3034. We match it quite closely.
  • Most of the important logic is contained in get_proposer_head in consensus/proto_array/src/proto_array_fork_choice.rs and BeaconChain::overridden_fork_choice_update.
  • Tests are in interactive_tests. They do fine-grained introspection of the execution engine at the time at which fcU is sent with payload attributes. The Hook in the mock execution layer allows collection of this data.
  • Fork choice handling of justified balances was refactored using the JustifiedBalances struct. This prevents re-summing of the justified balances which we now need in 3 places: proposer boost, re-org threshold, participation threshold.
  • The conditions for sending payload attributes are tweaked. We never send more than 4 seconds before the start of the slot, in order to maximise the number of transactions packed by execution engines that don't update the payload after the first call (Geth, currently). The maximum time is also configurable via --prepare-payload-lookahead. This change simplified some of the reasoning and I think it's nice to have (I would also be happy to revert it if it's deemed too risky).
  • To simplify things only merged networks are supported (i.e. no Gnosis chain for now).

michaelsproul avatar Oct 18 '22 06:10 michaelsproul

I want to add some metrics for get_proposer_head and measure the impact of the new JustifiedBalances cache on fork choice run times before merging.

michaelsproul avatar Oct 28 '22 07:10 michaelsproul

I think this is ready to roll. On Ropsten performance looks great:

  • ForkChoice::get_head inside block processing is taking only ~2ms, with occasional spikes to 15ms, I think likely when computing justification/finalization. We didn't previously have a metric for this, but these times seem good. I'll post again if I can confirm an improvement due to the JustifiedBalances cache.

block_processing_fork_choice

  • The override checker seems to run in sub 1ms territory most of the time with occasional spikes to ~5-15ms when it has to do some work:

fcu_override

  • Amazing the get_proposer_head function seems to take almost no time at all. It's too small to register on Prometheus but manually curling /metrics shows:
beacon_block_production_get_proposer_head_times_sum 0.000023526
beacon_block_production_get_proposer_head_times_count 1

Will roll out on Goerli to get some more numbers.

michaelsproul avatar Oct 31 '22 04:10 michaelsproul

One thing I do wonder about is how builders are going to handle these re-orgs. My understanding (primarily from this issue) is that builders are relying on some signals from a custom-fork (scream_cat cry) of Prysm that tells them to start building on the head. If they see a late-block I assume they'll start preparing a payload on top if it. If they're using Prysm whilst it doesn't have this re-org feature, I assume they'll be surprised by the Lighthouse block building atop the late-blocks parent (rather than the late-block).

I do not have an answer to this and I was assuming that builders will prepare blocks in parallel. But assuming they wont: even when we implement this feature, prysm will still call engine_newPayload upon receiving a block. So this is just a headsup. I'm frankly a little surprised that builders would start to build at a new payload event instead of FCU. In any case, we will call NewPayload immediately, but will start to delay the call to FCU on late blocks.

potuz avatar Nov 10 '22 00:11 potuz

Regarding the builders they seem to be ready. @metachris on Discord said that they already have some heuristics for building blocks atop multiple heads: https://discord.com/channels/595666850260713488/874767108809031740/1019528332146130955

In the worst case where the builder doesn't have a block ready we'll just fallback to local block production. I've been testing on Ropsten for the last few weeks with the blocknative and blox relays and they seemed to work for re-org blocks (although they're currently not being used due to loss of finality on Ropsten).

michaelsproul avatar Nov 10 '22 00:11 michaelsproul

Thank you for the thoughtful review @paulhauner! I've responded to each of your feedback points and made a few cosmetic changes in https://github.com/sigp/lighthouse/pull/2860/commits/aa0d85ee6672c9f02d616947234e328293dd5d86.

I think the main outstanding points before we're ready to merge are:

  • Decide whether to bump the default min participation (particularly on mainnet).
  • Confirm that builders are up to scratch at least on Ropsten where this is currently running (will report back). My existing test hadn't proposed any builder blocks due to poor chain health, but I'm now running with the chain health checks disabled.

michaelsproul avatar Nov 10 '22 05:11 michaelsproul

Yeah, Francesco posted some feedback here which might mean we should change the circuit-breaker conditions before merging: https://github.com/ethereum/consensus-specs/pull/3034#issuecomment-1310954171.

Another option would be to merge this and apply tweaks later. I think there are no show-stopping objections, and we could even tweak the conditions prior to our next release.

michaelsproul avatar Nov 13 '22 23:11 michaelsproul

I'm marking this ready for review again :tada:

Conversation on the spec side has died down with seemingly everyone satisfied by the finalization check (the participation rate check has been axed for now). Most of the conversation occurred on the Ethereum R&D Discord: https://discord.com/channels/595666850260713488/595701173944713277/1049263250275061790, with Mikhail, Francesco, Potuz, JGM and Chris & Ruteri (Flashbots) all weighing in.

I think we should merge the implementation in Lighthouse before the upstream spec merges, because the upstream spec is blocked on the unrealized justification/finalization spec for fork choice, but is otherwise ready. We can also tweak the implementation after merging if desired.

On the builder side, it seems that BloxRoute are ready for late blocks arriving after 4 second and Flashbots are ready for late blocks arriving after 6 seconds. I've asked the Flashbots devs to update their cutoff to 4 seconds as well. I think incentives are sufficient that block builders will adopt these changes rapidly as the re-org strategy gains popularity on mainnet. In case of builder failure there's always the local fallback.

I'm running the re-org strategy on all ~20K of Sigma Prime's Goerli validators. I've collected some successful re-orgs that used builder blocks (all BloxRoute for now) here: https://gist.github.com/michaelsproul/720459c93f20ee61911ab1732337f7ca

michaelsproul avatar Dec 13 '22 00:12 michaelsproul

bors r+

michaelsproul avatar Dec 13 '22 05:12 michaelsproul

Build failed (retrying...):

bors[bot] avatar Dec 13 '22 05:12 bors[bot]

Build failed:

bors[bot] avatar Dec 13 '22 06:12 bors[bot]

bors retry

michaelsproul avatar Dec 13 '22 09:12 michaelsproul