Sync may stall if SimpleSidecarRetriever's RPC calls fail
2025-11-28 14:47:30.988 DEBUG - ReqResp outbound data_column_sidecars_by_root, columns: 0/1 in 1900 ms, peer 0xf7628e30...c1ef83e9: request: 8 columns: 0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6 colIdxs: (len: 8) [24,28,52,61,71,73,90,106], response: <empty>
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=28] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=24] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=61] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=52] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=73] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=71] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=90] due to: DasColumnNotAvailableException
2025-11-28 14:47:30.989 DEBUG - SimpleSidecarRetriever.Request failed for DataColumnSlotAndIdentifier[slot=1843631, blockRoot=0x0025cc167bfbf651b62c605d9d1bcafab09502bdcd8b6c404d0ae8c1b01c47c6, columnIndex=106] due to: DasColumnNotAvailableException
2025-11-28 14:47:31.219 TRACE - SimpleSidecarRetriever.nextRound: completed: 312, errored: 8, total pending: 8, active pending: 0, new active: 0, number of custody peers: 128 cols: 0 peers
2025-11-28 14:47:32.224 TRACE - SimpleSidecarRetriever.nextRound: completed: 312, errored: 8, total pending: 8, active pending: 0, new active: 0, number of custody peers: 128 cols: 0 peers
2025-11-28 14:47:33.230 TRACE - SimpleSidecarRetriever.nextRound: completed: 312, errored: 8, total pending: 8, active pending: 0, new active: 0, number of custody peers: 128 cols: 0 peers
2025-11-28 14:47:34.236 TRACE - SimpleSidecarRetriever.nextRound: completed: 312, errored: 8, total pending: 8, active pending: 0, new active: 0, number of custody peers: 128 cols: 0 peers
2025-11-28 14:47:34.461 INFO - syncing, skipping backfilling
2025-11-28 14:47:35.241 TRACE - SimpleSidecarRetriever.nextRound: completed: 312, errored: 8, total pending: 8, active pending: 0, new active: 0, number of custody peers: 128 cols: 0 peers
2025-11-28 14:47:36.005 INFO - Syncing *** Slot: 1843688, Head slot: 1843630, Remaining slots: 58, Target chain: ae2aa0..1f1a (1843685) with 0 peers, Connected peers: 0
It seems related to https://github.com/Consensys/teku/blob/e9298a1d4d5160254cf418e61554af333db5401d/ethereum/statetransition/src/main/java/tech/pegasys/teku/statetransition/datacolumns/retriever/recovering/SidecarRetriever.java#L222
essentially the new recovering sidecar retriever is cancelling the requests and sampler never get RPC responses back
I think there is a fundamental bug in SidecarRetriever: If it decides to not recover it must cancel only it's own additional RPC calls, not the ones that originated the process (coming from retrieve method), otherwise it will cancel requests on behalf of the requester.
So it should track the origin of the requests.
It should also be able to update the origin (it triggers the recovery requests and while the process is ongoing there is a retrieve call for the same column)
The reason why the Sampler is not detecting\printing anything is due to:
https://github.com/Consensys/teku/blob/cb988a945980e45aa055cb37fa4ba2e3455fdf3d/ethereum/statetransition/src/main/java/tech/pegasys/teku/statetransition/datacolumns/DasSamplerBasic.java#L151-L152
which seems incorrect at first sight. The sampler is not a proxy, is the originator of it's own requests, and currently is not cancelling anything so it should threat it as an error.
It seems related to
Line 222 in e9298a1
request.cancel(); essentially the new recovering sidecar retriever is cancelling the requests and sampler never get RPC responses back
it waits until its marked failed download (exceeds a timeout, 40s by default) then cancels them here... should we wait longer than 40s for a reply to a single column?
it might be that something here isn't handling the timeout, it might be that the timeout is too short but basically the retriever had been waiting until the requests timed out to get to the point of cancelling...
which seems incorrect at first sight. The sampler is not a proxy, is the originator of it's own requests, and currently is not cancelling anything so it should threat it as an error.
ignoreCancelException seems to be what a lot of this code is doing.
I think there is a fundamental bug in
SidecarRetriever: If it decides to not recover it must cancel only it's own additional RPC calls, not the ones that originated the process (coming fromretrievemethod), otherwise it will cancel requests on behalf of the requester.
on this, just looking, we only have our own calls, so its true we cancel the future we hand back as a tracking future, but we dont cancel external RPC calls, we only manage our own strictly (created in SidecarRetriever.retrieve function).
Based on this, if something is not handling a cancellation of the maintaining future, it should probably be handling that, but i do need to track down what happens after cancel, if we free up the space to create another attempt, or if it is permanently failed, and on what timeframe (i think its an async and shouldnt need to be)...