consensus-specs PeerDAS fork-choice, validator custody and parameter changes

This PR does three things:

Introduce the parameter changes discussed at the interop. I set the subnet count to 128 instead of 64 after discussions with the Codex team and Dankrad, the idea being that we might as well try out something more ambitious (better ratio of custody/total data) and then go back to 64 if devnets/testnets point to that. Happy to revert to 64 if this turns out to be a contentious choice. For context, a subnet count of 64 would mean that nodes with a single validator attached custody 1/4 of the original data, which still gives us quite a bit of room to increase the blob count without increasing bandwidth consumption.
Introduce validator custody. Full nodes still custody a minimum 1/32 of the extended data, as in the current spec (the minimum custody is CUSTODY_REQUIREMENT = 4 out of 128 subnets), while nodes with validators attached are asked to custody at least VALIDATOR_CUSTODY_REQUIREMENT = 6 subnets, to provide a minimum level of security to their attestations, plus one extra subnet for every 16 ETH of balance (by balance and not by validator count, to account for the maxeb change). Any node with at least 61 min balance validators (~2000 ETH) would by default download the whole data and always be able to reconstruct whenever possible. Moreover, its consensus participation would be completely unaffected by sampling, making it much harder for sampling to introduce any consensus risk. Edit: on Justin's suggestion, validator custody has been changed so that the rule is "1 subnet per 32 ETH, minimum 8, maximum 128".
Clarify the role of data availability in the fork-choice. I propose to mostly rely on the custody check, in particular using it to filter out unavailable blocks in get_head, rather than not importing them at all. Peer sampling is instead only used to gate justifications and finalizations (by not importing blocks whose state has an unavailable unrealized justification), accomplishing two goals. Firstly, it ensures that transaction confirmation by waiting for finality has an extra layer of safety. Moreover, it makes it harder for validators to end up voting to finalize an unavailable checkpoint in case of a supermajority attack. Restricting the use of peer sampling to these limited goals (where it actually has meaningful benefits over custody checks) means that it is also very hard for it to disrupt consensus.

Resources:

Todo:

Agree on the parameters, in particular subnet count and the validator custody parameters
Decide whether we instead want to have validator custody be assigned in protocol, to have (at least social) accountability in case of extreme failures like finalization of an unavailable block? The trade-off is that in the current design validator custody contributes to the network (if you have a peer with many validators, it will be reflected in their advertised custody and you can use that information for peer sampling) without deanonymization concerns.
Decide whether we are ok with the "normal" fork-choice proposed here, or if we want to introduce some variant of (block, slot) to deal with the attack where a (non supernode) proposer is tricked into extending an unavailable block. Alternatively, we could also have proposers do peer sampling when blocks have very little weight. See here and here for more context. Currently my thinking is that this attack is only restricted to a small percentage of proposers (the ones attached to a node with < 61 validators) and is not much easier than a proposer DoS, so perhaps we can treat it the same way, i.e., watch out for it and have credible countermeasures to implement if needed (while SSLE also works in this case as well, we can fix the problem completely with much simpler fork-choice changes, so it would be reasonably simple to deal with the problem if it were to actually come up). Moreover, the attack requires controlling two slots in a row (proposer boost reorging would activate otherwise), so it's even a bit harder.

May 24 '24 12:05 fradamt

The vast majority of Ethereum mainnet stake is run by entities controlling > 64 validators each. So with validator custody, a ~90% majority will be gossiping and importing everything. Any issues with partial custody or sampling will affect a small minority and may not even affect the overall network's health noticeably.

I am not judging this fact, but feels like an important consideration.

Jun 05 '24 12:06 dapplion

The vast majority of Ethereum mainnet stake is run by entities controlling > 64 validators each.

"Majority of stake" yes, but that does not necessarily translate into "majority of nodes"

So with validator custody, a ~90% majority will be gossiping and importing everything.

According to historical data from crawlers, it was estimated that only about ~10% of the nodes had over 64 validators.

Any issues with partial custody or sampling will affect a small minority and may not even affect the overall network's health noticeably.

I actually expect the majority of the nodes to run small custody sets. But I agree this an important thing to keep in consideration.

Jun 12 '24 15:06 leobago

The vast majority of Ethereum mainnet stake is run by entities controlling > 64 validators each.

"Majority of stake" yes, but that does not necessarily translate into "majority of nodes"

So with validator custody, a ~90% majority will be gossiping and importing everything.

According to historical data from crawlers, it was estimated that only about ~10% of the nodes had over 64 validators.

Any issues with partial custody or sampling will affect a small minority and may not even affect the overall network's health noticeably.

I actually expect the majority of the nodes to run small custody sets. But I agree this an important thing to keep in consideration.

When it comes to the stability and security of consensus, the minority of the nodes which has 90% of the stake is mostly what matters. Even if most nodes in the network were regular nodes doing the minimum custody, we would still get huge benefits from 90% of the stake downloading everything, because consensus would basically be unaffected by availability issues, and everyone else (even non staking nodes and nodes with few validators) would end up following the same fully available chain.

Jun 14 '24 10:06 fradamt

I just did a research on peer count. https://notes.ethereum.org/@pop/peer-count-peerdas (it's still WIP)

I have a concern on CUSTODY_REQUIREMENT=4 out of 128 subnets. It increases the number of peers you need to cover all subnets from 32 peers to 172 peers which is a lot.

>>> peer_count(128, 4)
172.0125
>>> peer_count(32, 1)
32.0

cc: @cskiraly

Jul 16 '24 14:07 ppopth

I just did a research on peer count. https://notes.ethereum.org/@pop/peer-count-peerdas (it's still WIP)

I have a concern on CUSTODY_REQUIREMENT=4 out of 128 subnets. It increases the number of peers you need to cover all subnets from 32 peers to 172 peers which is a lot.
>>> peer_count(128, 4)
172.0125
>>> peer_count(32, 1)
32.0
cc: @cskiraly

Good thing to point out :) While I do agree that this is a concern and something we should definitely take into account in deciding the parameters, I think we should also keep in mind that it is a worst case measure that assumes all nodes to be full nodes. If we were to assume all nodes are validators (also not correct ofc), the relevant number would be peer_count(128, 8), which is 85. And still, that leaves out nodes with multiple validators, which have a higher custody requirement.

Still, we could consider being conservative and setting for example custody group count to 128 and minimum custody requirement for full nodes to 8. For quite some time, this wouldn't be a problem, as we would still be able to go even up to a max of 48 blobs per slot without increasing full node bandwidth requirements compared to 4844. Eventually, we can hopefully increase peer counts and be less conservative about parameter choices.

Jul 16 '24 16:07 fradamt