solana
solana copied to clipboard
Sets default full snapshot archive interval to 50,000 slots
Problem
Right now full snapshots are ~45 GB each, and archiving/packaging takes a significant amount of time. If we can do this less often, that would be a win for everyone. Additionally, this gives more opportunity for a single full snapshot to be "useful". Meaning on restart it's more likely that an existing full snapshot will still be fresh, and only an incremental snapshot will be required to download. For comparison, incremental snapshots usually are in the 1 - 3 GB range.
Soon, the Incremental Accounts Hash feature gate will be enabled on mnb. This reduce the amount of work the accounts hash calculation must perform for incremental snapshots. Increasing the full snapshot interval will mean that we can do the cheaper incremental accounts hash more often (although not a ton more).
Summary of Changes
Change the default for --full-snapshot-interval-slots to 50,000.
are there any potential downsides?
what is the difference in peak incremental snapshot size at a full snapshot cadence of 50k from the 1-3GB at 25k?
For comparison, incremental snapshots usually are in the 1 - 3 GB range.
Similar concern as Trent - how big will incrementals be at the end of the 50k full interval? Not that we should be held hostage by poorly performing nodes, but I know some nodes were struggling to get incrementals done in time for that 100 slot interval. Wondering if this change will exacerbate that issue / wonder if we need to bump the incremental interval up in tandem
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
564f1a9) 81.9% compared to head (6b339e0) 81.9%. Report is 547 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #34237 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 819 819
Lines 219380 219380
=========================================
- Hits 179734 179709 -25
- Misses 39646 39671 +25
what is the difference in peak incremental snapshot size at a full snapshot cadence of 50k from the 1-3GB at 25k?
I've started up a node to gather this information. I'll reply back once it has gone through a full snapshot interval and I can get all the incremental snapshot sizes.
The node has now gone through multiple full snapshot intervals. The size of incremental snapshots over this interval go from ~650 MB to ~6.5 GB.
Related, here's the Snapshot Packager Service.
Something to notice, yes, by making the full snapshot interval longer, the incremental snapshots do start taking longer to package up. Just before halfway through the full snapshot interval, this node now takes more time to package up an incremental snapshot than the 100 slots default interval.
The SPS implements a priority queue, so if there are multiple outstanding incremental snapshots to package, only the latest will be packaged and the older ones will be dropped. So we aren't doing extra work here in SPS. We are doing a bit extra work in AccountsBackgroundService and AccountsHashVerifier, by snapshotting a bank and then calculating the accounts hash even though this will not end up getting packaged up.
Based on this, I'd be inclined to bump the default incremental snapshot interval to 200 slots. Wdyt?
The SPS implements a priority queue, so if there are multiple outstanding incremental snapshots to package, only the latest will be packaged and the older ones will be dropped.
is this worded correctly? if we're always preferring the newest request, it sounds like we can get into a situation where we never complete any request. that is, it would make more sense to prioritize a request that is presently being serviced and drop any requests that arrive in the meantime.
So we aren't doing extra work here in SPS. We are doing a bit extra work in AccountsBackgroundService and AccountsHashVerifier, by snapshotting a bank and then calculating the accounts hash even though this will not end up getting packaged up.
what would it take to cancel work all the way up the chain?
Based on this, I'd be inclined to bump the default incremental snapshot interval to 200 slots. Wdyt?
what does it look like in practice?
what problem are we trying to solve again here? all that's been cited is an amorphous "win" afaict
This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave