erigon
erigon copied to clipboard
Add flags for fine-grained control of header sync
Two new flags:
--sync.headers.stepsize=N
When set, the headers stage will count headers as it inserts them, and will quit cleanly (moving onto the next stage) when it attempts to insert a header after already having successfully inserted N headers. This count is reset each time the headers stage starts.
The stepsize flag is helpful for development, to debug code changes that may generate bad blocks. Passing --sync.headers.stepsize=1 can be used to force staged sync to act like non-staged-sync, running all stages over a single block at a time. Execution that produces corrupt state will be immediately caught by IntermediateHashes for the same single block; and only a single block will need to be unwound.
--sync.headers.stopat=N
When set, attempting to insert a header with a height above N will cause the headers stage to quit cleanly. Additionally, before starting to fetch, the headers stage will check whether the set of headers already in the DB are ≥N, and if so, the headers stage will immediately quit cleanly.
The stopat flag would be used to sync to an exact block, to then stop the node and produce an archive or backup. This has use-cases both for development (e.g. to stop+backup a node in a consistent state, one block before an untested upgrade will take place, so that the node can be restored from backup if the block corrupts the state) and production operations (e.g. to create "time-sharded" nodes that only contain data up to a given block.)
Things not yet implemented/working:
- [ ] limiting for POS sync (currently only implemented on POW side)
Bugs to fix, that I'm not sure how to fix — input would be appreciated:
- [ ] If the headers stage quits before starting due to
stopat, all other stages run, and it comes back to the headers stage, the stage will do no work, and then all other stages will also have no work to do because of that, so the node will enter a busy loop running quickly through all the stages, accomplishing nothing. The headers stage should detect that it's the second time it's decided to do nothing, and, instead of just quitting itself, it should signal the node as a whole to shut down (or, perhaps, just shut down the sync/P2P aspect of the node, if embedded RPC is enabled. The whole node can shut down if RPC is external, or if running ascmd/sentry.) - [ ] if the only headers acceptable to the node under these flags' limits aren't in the DB as canonical headers, but are already queued locally (due to e.g. a previous unwind), then the headers sync stage will get stuck waiting for new headers that never arrive. (This is a bug in Erigon staged sync generally, and applies even without this PR: a node holding perfectly-good headers in its queue [e.g. previously "future" headers] seemingly won't reprocess + potentially insert them, until triggered to do so by inserting at least one new header from the network. These limiting flags just make the behavior a lot more visible.)
Things that could be improved:
- [ ] Probably people who would use the
stepsizeflag, don't really want to control the syncing of headers specifically, but rather want to control the step-size for staged sync as a whole. These are one-and-the-same if syncing new headers, but are not the same if many headers are already in the DB. A flag that limits maximum step-size for all stages, not just headers, would likely be more useful (but a lot more complex to implement.)
@tsutsu hi, thank you. can you please add couple use-cases (why people need it) to the description. N=0 - means unlimited step or no-sync?
@mandrigin @Giulio2002 hi. maybe you can advise better naming for this cli flags?
You could rename stopat flag with --sync.headers.until= and just not add step size as they technically do the same thing
@tsutsu hi, help us to merge devel plz.
@Giulio2002 maybe --sync.until is better than --sync.headers.until ? because headers/bodies separation it's "detail of implementation"
TheMerge changed things. Now stage_header does download headers in reverse order. It complicates such feature. Maybe it can be part of stage_body.