consensus-specs icon indicating copy to clipboard operation
consensus-specs copied to clipboard

Rethink FFG target block

Open dankrad opened this issue 4 years ago • 2 comments

Currently, the target of the FFG vote is the first block of an epoch. This means when an epoch is justified [finalized], it is actually the first block of that epoch that is justified [finalized]. This looks and feels like an off by one error: Since we are always talking about "epoch" being finalized, it should be the last block of an epoch (the previous one) that validators are voting for, thus justifying [finalizing] all blocks in that epoch.

Advantages:

  • Fixing an off-by-one error that has already plagued us, and probably leads to services like beaconcha.in displaying participation statistics incorrectly
  • Validators would have 4/3 slots time to get the target right (in the first slot of an epoch) rather than 1/3. Since we are still seeing ~3% drops, this could make a big difference, especially when the network is unstable and high latency.

Disadvantages:

  • ?
  • Other than needing a hard fork, I currently can't see any. I would really be interested if anyone can find the old discussion where this was decided and why, from the current perspective it seems to just be a bug

dankrad avatar Jan 05 '21 16:01 dankrad

I personally wouldn't touch this. It is a bit semantically strange "finalize epoch N" doesn't immediately have a clear meaning, but I don't think it's terribly difficult to communicate/understand.

I don't actually remember the precise reason this was chosen. We had much debate around epoch transitions, where they happened and what was being voted on long ago. I'll see if I can do some digging to find/remember the reasoning.

One obvious (but not that important) reason is that in the genesis epoch, there is not "last block" in the previous epoch to vote on, but I don't think this is why the decision was made

Another likely historic reason was to just finalize as much as possible (i.e. the extra block).


with respect to the 3% drops on epoch boundary, I expect this to be addressed in a security patch to the fork choice pretty soon. A vote for emptiness at the epoch boundary would be respected and thus that 3% be correct.

djrtwo avatar Jan 06 '21 23:01 djrtwo

Since we are always talking about "epoch" being finalized, it should be the last block of an epoch (the previous one) that validators are voting for, thus justifying [finalizing] all blocks in that epoch.

It's not necessarily a blocker, but something to consider is client optimizations. AFAIK, most clients use some scheme whereby states on epoch boundaries are more readily-available than "in between" states.

If you get an attestation and you need to load a state to obtain the shuffling, then reading the data.target.root state is generally more efficient than the data.beacon_block state and you'll obtain the same attester shuffling from that block.

Once again, I'm not describing a blocker here. If we move forward with this it would be worth clients/block-explorers/etc doing some introspection to where they're making assumptions around the target root. (I'm not even sure LH uses that above-mentioned optimization anymore, it was just an example).

Note: I just realised I assumed that AttestationData would have different semantics if this were implemented. That seems like the case but I'm not certain that's the intention.

Since we are still seeing ~3% drops, this could make a big difference, especially when the network is unstable and high latency.

When it comes to stable/healthy chains (like mainnet) I think these 3% drops can/should be optimized out by clients. I've been doing a lot of analysis on LH over the past weeks/month to figure out why they're happening. There's definitely some cases in LH where we were performing sub-optimally (e.g., https://github.com/sigp/lighthouse/pull/2174, https://github.com/sigp/lighthouse/pull/2243, https://github.com/sigp/lighthouse/pull/2155).

I understand it would be better if the chain could perform optimally in the case where block production is delayed (this would help in adversarial situations). However, the point I'm trying to make is that if we're trying to resolve 3% drops on mainnet then I think we should first exhaust investigations into implementation inefficiency before we start making protocol changes which carry additional risk.

paulhauner avatar Mar 15 '21 00:03 paulhauner