nearcore icon indicating copy to clipboard operation
nearcore copied to clipboard

RocksDB: rocksdb/rust-rocksdb forks, single sharded LRU cache, explicit cache for ProcessedBlockHeights, no checksum validation

Open EdvardD opened this issue 1 year ago • 1 comments

This PR was build on top of rust-rocksdb v0.19.0 which were in master at the time. Now it's reverted back to v0.18.0. The same can be done with v0.18.0 but will require to recollect all stats so probably it worth to be merged when v0.19.0 is returned back.

Average read time for europe-west4, n2-standard-8 machine:

testnet archival under 'sweat'-load:

All columns: 493us -> 344us (-30%)
State: 1.44ms -> 1.2ms (-17%)
ChunkHashesByHeight: 475us -> 211us (-55%)
HeaderHashesByHeight: 2.47ms -> 781us (-68%)
BlockPerHeight: 399us-> 53us (-86%)
BlockRefCount: 2.21ms -> 820us (-62%)
ProcessedBlockHeights: 92.8ms -> 13us (-99.9%)

mainnet archival under normal load for 18 hours:

All columns: 409us -> 197us (-53%)
State: 1.90ms -> 1.64ms (-13%)
ChunkHashesByHeight: 253us -> 225us (-11%)
HeaderHashesByHeight: 1.68ms -> 744us (-55%)
BlockPerHeight: 66us -> 63us (neutral)
BlockRefCount: 1.4ms -> 767us (-45%)
ProcessedBlockHeights: 105ms -> 17us (-99.9%)

testnet rpc under 'sweat'-load for 24 hours:

All columns: 284us -> 230us (-19%)
State: 920us -> 816ms (-11%)
ChunkHashesByHeight: 30us -> 29us (neutral)
HeaderHashesByHeight: 40us -> 39us (neutral)
BlockPerHeight: 23us-> 23us (neutral)
BlockRefCount: 42us -> 41us (neutral)
ProcessedBlockHeights: 30us -> 27us (neutral)

EdvardD avatar Aug 16 '22 15:08 EdvardD

This looks seriously impressive, thanks @EdvardD !

matklad avatar Aug 16 '22 16:08 matklad

re disabling CRC (change in table/block_based/reader_common.cc):

 Status VerifyBlockChecksum(ChecksumType type, const char* data,
                            size_t block_size, const std::string& file_name,
                            uint64_t offset) {
+  return Status::OK();
   PERF_TIMER_GUARD(block_checksum_time);

Did I miss any discussion about this? Upon a first blush, this seems like a major change with non-trivial implications to many things, among them potentially execution determinism (if we can’t detect data having been corrupted, nodes might start computing different outcomes… maybe that’s fine? I don’t know.)

Do you know what the contribution to the perf improvement is from the num_shards change vs the CRC disable?

nagisa avatar Sep 01 '22 08:09 nagisa

Thanks for the review here @nagisa. With Edvard's departure, we will have to figure out who can now pick this work up.

Do you know what the contribution to the perf improvement is from the num_shards change vs the CRC disable?

This seems like a good question. It appears to me that the above experiments combined both the changes together.

Did I miss any discussion about this? Upon a first blush, this seems like a major change with non-trivial implications to many things, among them potentially execution determinism (if we can’t detect data having been corrupted, nodes might start computing different outcomes… maybe that’s fine? I don’t know.)

I think there was some high level discussion in one of the weekly storage meetings. And the initial conclusion was that it should not matter but I think it is worth it to have a discussion in more public forum (zulip) and have the conclusions be documented.

akhi3030 avatar Sep 01 '22 09:09 akhi3030

yeah, skipping CRC seems perilous, esp in the light of https://github.com/near/nearcore/issues/7515

matklad avatar Sep 01 '22 11:09 matklad

Another way to view it is that disk corruption is just one possible corruption. Really we need a system which doesn’t penalise validators for a cosmic rays bit flipping data in their RAM. Having said that, it’s not necessarily that I disagree with the concerns.

mina86 avatar Sep 01 '22 14:09 mina86

Don't think we will working on this PR anymore.

akhi3030 avatar Nov 29 '22 13:11 akhi3030