teku icon indicating copy to clipboard operation
teku copied to clipboard

DataColumnSidecars Sampler should sample minimal 8 columns not 4

Open jeroost opened this issue 6 months ago • 1 comments

Per spec: At each slot, a node advertising custody_group_count downloads a minimum of sampling_size = max(SAMPLES_PER_SLOT, custody_group_count) custody groups, selected by groups = get_custody_groups(node_id, sampling_size), to which correspond the columns compute_columns_for_custody_group(group) for group in groups

PR Description

Fixed Issue(s)

fixes #9598

Documentation

  • [ ] I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • [ ] I thought about adding a changelog entry, and added one if I deemed necessary.

jeroost avatar Jun 24 '25 14:06 jeroost

Hello! Looks great, but it's a major change. Could you please add some tests?

zilm13 avatar Jun 26 '25 12:06 zilm13

Added required changes to get data through gossip subscription https://github.com/Consensys/teku/pull/9599/commits/d66e36d662a11302da8059a73a8ce1f71fdca2d1 Still not passing DasSyncAcceptanceTest sometimes, blocked as draft until it's resolved

zilm13 avatar Jun 30 '25 22:06 zilm13

There are a couple of issues with DasSyncAcceptanceTest and not sure if any are related. In general the non-supernode is 4-8 slots behind the supernode and does not catch up because of a variety of issues.

  1. Muxer exception io.libp2p.core.ConnectionClosedException: Channel with id 4eb2adfffe301501-00000001-00000001-9dd100318e6f0266-8ba6ccaf/387/true not opened at io.libp2p.etc.util.netty.mux.AbstractMuxHandler.childRead(AbstractMuxHandler.kt:61) ~[jvm-libp2p-1.2.2-RELEASE.jar:?] at io.libp2p.mux.mplex.MplexHandler.channelRead(MplexHandler.kt:32) ~[jvm-libp2p-1.2.2-RELEASE.jar:?] These seem to happen also in passing tests without this change. Except there they are not fatal and sync recovers. They are always followed with java.util.concurrent.CompletionException: tech.pegasys.teku.networking.eth2.rpc.core.RpcException$ExtraDataAppendedException: [Code 1] Extra data appended to end of message: extra data: ByteBuf{UnpooledSlicedByteBuf(ridx: 0, widx: 11096, cap: 11096/11096, unwrapped: UnpooledHeapByteBuf(ridx: 0, widx: 11116, cap: 11116/11116)), content: 0x008d12ebb6

  2. checkDataAvailability Sometimes up to 4 times the same block is checked for availability.

  3. checkDataAvailability from a locally produced block (all 128 columns available) sometimes does not give all of the columns, probably because they are still being inserted. checkDataAvailability(): got 49 (of 128) columns from custody (or received by Gossip) for block 21 (0x0ed0f801f99230e3c694121b16246c67e790da9fb0328ab694245b6ebbaff2ac), columns: (len: 49) [0..48] The result is it tries to retrieve the remaining columns from peers, but here there is only one peer and it would be better to retry custody after a small delay or to have a lock of custody write/read until all columns are inserted.

Still investigating further

jeroost avatar Jul 17 '25 08:07 jeroost