solana icon indicating copy to clipboard operation
solana copied to clipboard

not deleting appendvec files so we don’t have to waste time untarring a snapshot

Open xiangzhu70 opened this issue 3 years ago • 3 comments

Problem

A part of current validator startup steps are to clear the accounts directory, deleting the appendvec files, and then untar the snapshot files to recreate the appendvecs files. Untarring the large appendvec files takes a long time. If the appendvec files exist and are the same to the ones in the snapshot, untarring them can be skipped to save some startup time.

Proposed Solution

Not delete the appendvec files, and skip untarring them from the snapshot.

xiangzhu70 avatar Sep 14 '22 21:09 xiangzhu70

Moving my points from slack to here for more visibility.

Assuming you have a snapshot, and the append vecs should be consistent with that snapshot:

  1. Unpack the snapshot only to get the bank fields (and version probably) but NOT the appendvecs themselves. I'd do this serially since that should be a single file at the beginning of the snapshot.
  2. That will give you a list of appendvecs that were used to generate that snapshot, as well as other fields, including the accounts hash.
  3. Load appendvecs from file system: a. make sure they are in the set from the snapshot file b. make sure you have ALL the appendvecs from the set you got from the snapshot file
  4. Generate index and continue with normal start up assuming from the checks in 3 mean that you have all the accounts a. As part of normal start up from snapshot we calculate accounts hash and compare to what was in the snapshots file.

I think 4a is sufficient to confirm that you have loaded accounts from the snapshot (nothing more, nothing less). I think technically there is a chance of hash collision, but it's incredibly unlikely given the other checks we made

apfitzge avatar Sep 14 '22 21:09 apfitzge

See PR https://github.com/solana-labs/solana/pull/27854 for the initial version.

Tested the following: If the accouts/ directory is empty, it untars the accounts files from the snapshot like before; If the accounts files exist, skip untarring the files. It goes through bank.verify_snapshot_bank() and pass the checks in the same way. I manually modify one of the files to remove a few bytes. The first try failed with "incorrect layout/length/data". Then it removed all the accounts files, and make the call again, and then succeeded.

I don't know how to fake file data to make it fail in bank.verify_snapshot_bank(). But manually forcing it to return false caused a panic in bank_from_snapshot_archives(). Maybe I will move the retry logic into bank_from_snapshot_archives(), so re-try can be done immediately after the hash failure.

xiangzhu70 avatar Sep 17 '22 02:09 xiangzhu70

In bank_from_snapshot_archives, it panics if !bank.verify_snapshot_bank() && limit_load_slot_count_from_snapshot.is_none()

What is this limit_load_slot_count_from_snapshot? When bank.verify_snapshot_bank() returns false, something is already wrong, why does this limit_load_slot_count_from_snapshot prevent panic?

I am adding the retry logic as first try with existing accounts file, if verify_bank() fails, delete files and retry. how this limit_load_slot_count_from_snapshot should affect the retry condition?

xiangzhu70 avatar Sep 19 '22 17:09 xiangzhu70

With https://github.com/solana-labs/solana/pull/27854 I see a few problems:

No file is skipped because after the unpacking, in rebuilder.process_complete_slot(slot), the appendvec is remapped. Given that the root is frozen (fixed), why would the appendvecs are remapped to different files? For now, I disabled remapping, so the files can be skipped.

With skipping effective, the saving is about 1/3 of the unpacking time. On a 32GB snapshot, the 115.8s time is reduced to 77.2s. Tried with a single-thread standalone app, and proved that this is the time it takes to run through decompression of all the files, even no unpacking is done. To avoid spending time on decompression, we can just abort the decompression right after the snapshot/ files (meta data about the snapshot) is unpacked, skipping the rest accounts files. There is a version file at the end of the snapshot, which should be moved into snapshot/ . The current design is to populate storage hashmap while unpacking the accounts files. With that skipped, the storage hashmap has to be populated from the meta data in the snapshot.

So, 3 changes will be made as below:

  1. Disable remapping after unpacking from a snapshot;
  2. Move version in the snapshot from the end into the early part in snapshot/
  3. Populate the storage hashmap from the meta data in snapshot/

xiangzhu70 avatar Sep 26 '22 18:09 xiangzhu70

On point 1:

  • We remap because a full and incremental snapshot could come from different sources. That can lead to a collision in the AppendVecIds.
  • We could potentially just disable the remapping for the full snapshot, and leave it for remote incremental snapshots.

On point 2:

  • What do you mean by "with skipping effective"?
  • Agree that this (not decompressing the tar after the snapshot and version) is the ideal case.

The three changes:

  1. is this only for snapshots loaded from disk w/o unpacking? the purpose of remapping is to avoid AppendVecId collisions from different snapshot sources (full & incremental could come form different nodes)
  2. has already been completed: #27192
  3. 👍

apfitzge avatar Sep 26 '22 18:09 apfitzge

Had a video meeting with @brooksprumo, @HaoranYi, and @apfitzge. Thank you all for the great help!

As Brooks suggested, we will first try constructing the bank from the file system directories (snapshot/ and accounts/), instead of from the highest snapshot archive file. Only when it fails to pass bank.verify_snapshot_hash, we will fall back to using the highest snapshot archive.

A small issue with building the snapshot from the directories is that if the previous process crashed/stopped in the middle of flush/clean/shrink, calculating the hash, then the files on the file system may not be in a good state. The bank.verify_snapshot_hash check will guarantee the validation of the state, and properly fall back to using the snapshot archive in case the files in the snapshot and accounts directories are not in a good state.

xiangzhu70 avatar Sep 27 '22 21:09 xiangzhu70

Do bank verify snapshot hash take into account of all the accounts in the account package? If so, then me might need to modify the hash verify from on disk account append vecs to exclude any accounts that are after the bank snapshot slot.

HaoranYi avatar Sep 27 '22 23:09 HaoranYi

Here is the current state of the files. If we build bank from these files, only the files snapshot/600/600.pre and accounts/* should be used, correct? @brooksprumo

xiangzhu@Xiangs-MacBook-Pro-2 bootstrap-validator % ls snapshot.ledger-tool xiangzhu@Xiangs-MacBook-Pro-2 bootstrap-validator % ls snapshot/ 100 200 300 400 500 600 xiangzhu@Xiangs-MacBook-Pro-2 bootstrap-validator % ls accounts/ 0.2 100.0 200.3 224.5 300.4 400.6 480.8 500.7 600.9 96.1 xiangzhu@Xiangs-MacBook-Pro-2 bootstrap-validator % ls accounts.ledger-tool accounts xiangzhu@Xiangs-MacBook-Pro-2 bootstrap-validator % ls accounts.ledger-tool/accounts 0.2 480.8 500.7 600.9

xiangzhu70 avatar Sep 28 '22 17:09 xiangzhu70

Hi team. Here is an updated plan.

The existing flow of constructing a bank from the snapshot archive is as below

bank_forks_from_snapshot bank_from_latest_snapshot_archives bank_from_snapshot_archives

In bank_from_snapshot_archives  the two major parts are:

  1. verify_and_unarchive_snapshots a. generate message channel pair (file_sender, file_receiver) b. in multiple unpack threads, decompress and unpack the files and send the file names to receiver thread c. in receiver thread (SnapshotStorageRebuilder::rebuild_storage) construct appendvecs following the files list in accounts/, remap them, and fill them into the storage
  2. rebuild_bank_from_snapshots(with storage passed in as part of StorageAndNextAppendVecId).

To construct a bank from the snapshot/ directory, I was originally thinking to let rebuild_bank_from_snapshots not use an externally passed-in storage, but add a low-level bank_from_stream function to directly get snapshot appendvec entries and construct the storage internally.   But now I think it will probably better to reuse the most of the SnapshotStorageRebuilder functions and flow, so the storage build code stays the same.

So, it will be like

bank_forks_from_snapshot bank_from_latest_snapshot_file bank_from_snapshot_file

In bank_from_snapshot_file, the 2nd part rebuild_bank_from_snapshots will be the same.  In the first part, the storage will be built similar to the archive way.

  1. SnapshotStorageRebuilder::new() will still be used and its storage field will be initialized in the same way.
  2. There will be no more threads. It will be like the one thread in 1.c which does one pass on all accountvec files.
  3. Instead of following the files list in accounts/, follow the appendvecs in snapshot/, and compare them with the files list in accounts/. The ones not referenced in snapshot/ will be deleted. The ones referenced in snapshot/ but do not exist in accounts/ will cause the process to abort, and it will fall back to the original snapshot archive way.
  4. No remapping before adding into the storage.

At this point, I assume the snapshot directory represent a full snapshot.  But I this the same process could possibly support incremental snapshot directories.

The above is to construct the bank.  After that, bank.verify_snapshot_hash() will be called to verify the bank hash.  Jeff suggested adding appendvec level hash computation and checking.  I think it will be done there.

Please let me know if there is any suggestion. Thanks!

xiangzhu70 avatar Oct 10 '22 21:10 xiangzhu70

There will be no more threads. It will be like the one thread in 1.c which does one pass on all accountvec files.

is this to avoid the remapping, or for speed? It seems like multiple threads would be faster; verifying each appendvec is an independent task.

apfitzge avatar Oct 11 '22 00:10 apfitzge

I thought multiple threads were at the unpacking side. checked again, yes, the rebuilding is also multi-threaded. OK, I can make it the same. But looking at the code, I haven't seen which part is intensive requiring the parallel execution. Af far as I can see, it is mainly calling AppendVec::new_from_file and adding it into the storage.

xiangzhu70 avatar Oct 11 '22 00:10 xiangzhu70

I thought multiple threads were at the unpacking side. checked again, yes, the rebuilding is also multi-threaded. OK, I can make it the same. But looking at the code, I haven't seen which part is intensive requiring the parallel execution. Af far as I can see, it is mainly calling AppendVec::new_from_file and adding it into the storage.

yeah, the new_from_file(...) will scan the entire appendvec. It's not doing much work, but it's all completely independent of any other appendvec, so we can pretty easily parallelize 😄

apfitzge avatar Oct 11 '22 01:10 apfitzge

Oh, I see now that new_from_file calls sanitize_layout_and_length, which does scan through the file. Ok, it justifies the need for parallelization.
Maybe per-appendvec hash computation can also be done here?

xiangzhu70 avatar Oct 11 '22 01:10 xiangzhu70

Maybe per-appendvec hash computation can also be done here?

Yeah, seems like a pretty natural place to do it if we're adding per-file hash verfication. Just return (Self, Hash) pair, otherwise we'd just have to scan through it again.

A note for the future: when you're working with individual appendvec files and testing the hashing you may find store-tool useful (runtime/store-tool). Would be great to add the hash as an output there too as you do this 😄

apfitzge avatar Oct 20 '22 15:10 apfitzge