consensus-specs icon indicating copy to clipboard operation
consensus-specs copied to clipboard

EIP-7594: PeerDAS open questions

Open ralexstokes opened this issue 1 year ago • 8 comments

Context

General background for PeerDAS design and goals:

https://ethresear.ch/t/peerdas-a-simpler-das-approach-using-battle-tested-p2p-components/16541

https://ethresear.ch/t/from-4844-to-danksharding-a-path-to-scaling-ethereum-da/18046

Open questions

Parameterization

Determine final parameters for a robust and secure network.

  • How many SAMPLES_PER_SLOT to hit the security level we want?
  • Compute MAX_REQUEST_DATA_COLUMN_SIDECARS as a function of MAX_REQUEST_BLOCKS and NUMBER_OF_COLUMNS: https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1479098221
  • What should the CUSTODY_REQUIREMENT actually be? See thread: https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1494647788

Availability look-behind

One particular parameter is how tight the sampling has to be with respect to block/blob processing and fork choice. For example, nodes could sample in the same slot as a block and not consider a block valid until the sampling completes. In the event this requirement is too strict (e.g. because of network performance), we could relax the requirement to only complete sampling within some number of trailing slots from the head. If we go with a trailing approach, are there additional complications in the regime of long-range forks or network partitions? Does working in this "optimistic" setting cause undue complexity in implementations?

Syncing

Some questions around syncing relating to PeerDAS and also the possible deprecation of EIP-4844 style sampling.

Deprecate blob_sidecars_by_root and blob_sidecars_by_range?

Can we deprecate these RPC methods? Note you would still sample anything inside the blob retention window.

DataColumnSidecarsByRoot and DataColumnSidecarsByRange

Currently missing a method for ByRange. Required for syncing in the regime where clients are expected to retain samples. What is the exact layout of the RPC method? Multiple columns or just one? See thread: https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1476067585

Peer scoring

How to downscore a peer who should custody some sample but can’t respond with it?

Network shards design

See here for more context on the proposal: https://github.com/ethereum/consensus-specs/pull/3623 Likely a good simplification. Would touch some of the PeerDAS details around mapping a given peer to their sample subnets. Some additional implications: https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1525029081

Subnet design

Map one column per subnet, unless we need to do otherwise, see https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520134142

ENR semantics

https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520237876

Spec refactoring

Misc. refactoring to align with the general spec style:

https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520124279 https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520133179 https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520151038 Ensure all comments with references to Deneb or 4844 are now EIP-7594 https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520164567 https://github.com/ethereum/consensus-specs/pull/3574#discussion_r1520171368

ralexstokes avatar Apr 05 '24 16:04 ralexstokes

Does working in this "optimistic" setting cause undue complexity in implementations?

Big yes, but note a similar gadget is required by ILs in their current design

Deprecate blob_sidecars_by_root and blob_sidecars_by_range?

They don't appear necessary as the proposer should distribute columns directly.

DataColumnSidecarsByRange

Useful for column custodians to fetch all columns for given subnet and epoch, like we do now for blobs

dapplion avatar Apr 06 '24 06:04 dapplion

If we go with a trailing approach, are there additional complications in the regime of long-range forks or network partitions? Does working in this "optimistic" setting cause undue complexity in implementations?

Imho we should avoid having the whole validator set operating in an optimistic setting, even if we were to ignore implementation complexity and just worry about consensus security. One attack that this enables is:

  • A proposer or a builder (importantly, not someone controlling much stake) proposes an unavailable block B, in particular available only in 15 out of 32 subnets.
  • Everyone in the 15 subnets where it is available votes for B because sampling is not required yet
  • Though B has a lot of votes, the next proposer does not build on it because sampling fails
  • Data is meanwhile made fully available. Sampling now succeeds for everyone.
  • No one votes for the new proposal because B has weight > proposer boost and the proposal does not extend it

This can perhaps be fixed by requiring the attesters to have their sampling done by 10s into the previous slot, while the proposer has a bit more time. More complexity, more timing assumptions. Also, this is just one attack, and it's not clear what the entire attack surface looks like.

There is a clear solution: the custody requirement needs to be high enough to provide strong guarantees even before we get to sampling (see here as well). High enough here means somewhere between 4 and 8, depending on the adversarial model we want to work with. With that, an attacker that does not control a lot of validators would fail at accruing many votes for a < 50% available block, and so it would be easily reorgable through proposer boost.

Some related things to keep in mind:

  • The efficiency gain we get in the distribution phase of PeerDAS compared to 4844 is DATA_COLUMN_SIDECAR_SUBNET_COUNT / CUSTODY_REQUIREMENT / 2, because nodes are required to custody CUSTODY_REQUIREMENT / DATA_COLUMN_SIDECAR_SUBNET_COUNT of the whole data, which is extended by 2x. For example, with current parameters PeerDAS would be 16x more efficient then 4844 (ignoring sampling): everyone downloads 1/32 of the 2x extended data, so an average throughput of 48 blobs would require the equivalent of the 4844 bandwidth for distribution. Even a much more modest ratio of 5x lets us move to 16/32 blobs with hardly any bandwidth increase (just a little bit for sampling)
  • By increasing the number of subnets, we can increase CUSTODY_REQUIREMENT without affecting the above mentioned ratio, or we can at least recover some of the lost efficiency. If we want to stick with 32 subnets, we could for example set the CUSTODY_REQUIREMENT to 4, which gives a 4x gain. In the initial rollout, we could even be more conservative, even if it does not allow much of a blob count increase. If we are ok with having 64 subnets like we do for attestations (and possibly all fitting together in the network shard paradigm?), then reasonable values could be 4/64 (8x), 6/64 (~5x), 8/64 (4x). Since in the short term we're likely not going to want to go past a max of 32 blobs, there might not be much reason to go beyond these values, e.g., up to 128 subnets.
  • A higher CUSTODY_REQUIREMENT / DATA_COLUMN_SIDECAR_SUBNET_COUNT ratio also means that we don't need as many honest peers in order to have good guarantees about being able to get our samples. Peer sampling can be generally more robust, and less dependent on there being many nodes with a high advertised custody.

Imo it makes a lot of sense to move from 4844 to PeerDAS gradually. We can do this not only by slowly increasing the blob count, but also by slowly decreasing the minimum proportion of data custodied by each node, i.e., the CUSTODY_REQUIREMENT / DATA_COLUMN_SIDECAR_SUBNET_COUNT ratio. For example, we could start with 3/6 blobs, 32 subnets, a custody requirement of 16, i.e., unchanged throughput and everyone still downloads the whole data, just changing the networking. At this point, we wouldn't even need sampling yet, and we could introduce it without it actually doing anything, just to test the behavior on mainnet. We could then fully introduce sampling while moving to 6/12 blobs and a custody requirement of 8, then 12/24 blobs and custody requirement of 4. From there, we can increase the subnet count to 64 etc...

How many SAMPLES_PER_SLOT to hit the security level we want?

I don't see why we would want more than 16, or even 16 - CUSTODY_REQUIREMENT.

fradamt avatar Apr 08 '24 10:04 fradamt

Is it worth also increasing the TARGET_NUMBER_OF_PEERS (currently 70), in addition to increasing the CUSTODY_REQUIREMENT?

With a target peer count of 70, and each peer subscribing to one subnet (out of 32), a healthy target peer count per subnet would be ~2 on average. This would potentially impact the ability for proposer to disseminate data columns to all 32 subnets successfully, and could potentially lead to data loss - assuming proposer isn't custodying all columns - we could potentially make an exception for proposer to custody all columns, but feels like it would be cleaner to just make sure we disseminate the samples reliably.

Although if we increase CUSTODY_REQUIREMENT to 4 this would already significantly reduce the likelihood of having insufficient peers in a subnet.

jimmygchen avatar May 01 '24 12:05 jimmygchen

Is it worth also increasing the TARGET_NUMBER_OF_PEERS (currently 70), in addition to increasing the CUSTODY_REQUIREMENT?

With a target peer count of 70, and each peer subscribing to one subnet (out of 32), a healthy target peer count per subnet would be ~2 on average. This would potentially impact the ability for proposer to disseminate data columns to all 32 subnets successfully, and could potentially lead to data loss - assuming proposer isn't custodying all columns - we could potentially make an exception for proposer to custody all columns, but feels like it would be cleaner to just make sure we disseminate the samples reliably.

Although if we increase CUSTODY_REQUIREMENT to 4 this would already significantly reduce the likelihood of having insufficient peers in a subnet.

We really shouldn't keep the CUSTODY_REQUIREMENT as is (even 4 is low) unless we go with a non-trailing fork-choice, so this shouldn't be as much of a problem in the short term. That said, if all clients agree that it's ok to do so, I think increasing the TARGET_NUMBER_OF_PEERS would be great, because even in the best case we'd have an average of ~7 peers per subnet (e.g. with CUSTODY_REQUIREMENT = 6 and 64 subnets). It also gives us more room to relax the custody ratio later.

fradamt avatar May 03 '24 08:05 fradamt

Something that I think should be added to the open questions is validator custody: should validators have their own custody assignment, at the very least when they're voting, if not even in every slot? This has two benefits:

  • If an unavailable block is finalized, validators can be asked (out of protocol) to provide the data they were supposed to custody, and socially slashed if they fail to do so after some deadline
  • There are two reasons to increase the CUSTODY_REQUIREMENT. One is to ensure that the average number of peers per subnet is sufficiently high, and another is to ensure that most validators won't vote for an unavailable block (the pre-sampling guarantees discussed here). Depending on TARGET_NUMBER_OF_PEERS, the former might require less custody than the latter, so the extra load can just be on validators, which need the extra custody for voting securely, and not on simple full nodes, for which it is unnecessary extra work.

Just as an example, we could set CUSTODY_REQUIREMENT to 4 and VALIDATOR_CUSTODY_REQUIREMENT to 2.

cc @adietrichs

fradamt avatar May 03 '24 09:05 fradamt

How many SAMPLES_PER_SLOT to hit the security level we want?

I have my LossyDAS for PeerDAS notebook here: https://colab.research.google.com/drive/18uUgT2i-m3CbzQ5TyP9XFKqTn1DImUJD

Of course it also covers the 0 losses allowed case. The main question here is I think setting the security level we want to achieve. Any thoughts on that?

cskiraly avatar May 13 '24 07:05 cskiraly

I see the following in the spec: TARGET_NUMBER_OF_PEERS should be tuned upward in the event of failed sampling.

What are we trying to address with this? If it remains in the spec, I think there should also be a mechanism (or recommendations) to return back to original values.

cskiraly avatar May 13 '24 07:05 cskiraly

Regarding TARGET_NUMBER_OF_PEERS: We need peers for two different things:

  • building the overlays, which is at the subnet level
  • sampling, which is at the column level

For the sampling, peer count is important, because the mechanism to sample fast from nodes that are not peers is not yet there, os I see this driving TARGET_NUMBER_OF_PEERS requirements. For the subnets, instead, my assumption would be that you can change your peerset based on the subnets assigned. If rotation is not too fast (or if there is no rotation), this should be doable. In that case, what you need is to reach target degree (plus some) on custody_size subnets.

TARGET_NUMBER_OF_PEERS I think should be tuned based on these two requirement, with sufficient safety margins.

cskiraly avatar May 13 '24 07:05 cskiraly

I am closing this issue because it seems stale. Please, do not hesitate to reopen it if this is a mistake

leolara avatar Jun 10 '25 09:06 leolara