nearcore
nearcore copied to clipboard
RocksDB: rocksdb/rust-rocksdb forks, single sharded LRU cache, explicit cache for ProcessedBlockHeights, no checksum validation
This PR was build on top of rust-rocksdb v0.19.0 which were in master at the time. Now it's reverted back to v0.18.0. The same can be done with v0.18.0 but will require to recollect all stats so probably it worth to be merged when v0.19.0 is returned back.
Average read time for europe-west4, n2-standard-8
machine:
testnet archival under 'sweat'-load:
All columns: 493us -> 344us (-30%)
State: 1.44ms -> 1.2ms (-17%)
ChunkHashesByHeight: 475us -> 211us (-55%)
HeaderHashesByHeight: 2.47ms -> 781us (-68%)
BlockPerHeight: 399us-> 53us (-86%)
BlockRefCount: 2.21ms -> 820us (-62%)
ProcessedBlockHeights: 92.8ms -> 13us (-99.9%)
mainnet archival under normal load for 18 hours:
All columns: 409us -> 197us (-53%)
State: 1.90ms -> 1.64ms (-13%)
ChunkHashesByHeight: 253us -> 225us (-11%)
HeaderHashesByHeight: 1.68ms -> 744us (-55%)
BlockPerHeight: 66us -> 63us (neutral)
BlockRefCount: 1.4ms -> 767us (-45%)
ProcessedBlockHeights: 105ms -> 17us (-99.9%)
testnet rpc under 'sweat'-load for 24 hours:
All columns: 284us -> 230us (-19%)
State: 920us -> 816ms (-11%)
ChunkHashesByHeight: 30us -> 29us (neutral)
HeaderHashesByHeight: 40us -> 39us (neutral)
BlockPerHeight: 23us-> 23us (neutral)
BlockRefCount: 42us -> 41us (neutral)
ProcessedBlockHeights: 30us -> 27us (neutral)
This looks seriously impressive, thanks @EdvardD !
re disabling CRC (change in table/block_based/reader_common.cc):
Status VerifyBlockChecksum(ChecksumType type, const char* data,
size_t block_size, const std::string& file_name,
uint64_t offset) {
+ return Status::OK();
PERF_TIMER_GUARD(block_checksum_time);
Did I miss any discussion about this? Upon a first blush, this seems like a major change with non-trivial implications to many things, among them potentially execution determinism (if we can’t detect data having been corrupted, nodes might start computing different outcomes… maybe that’s fine? I don’t know.)
Do you know what the contribution to the perf improvement is from the num_shards change vs the CRC disable?
Thanks for the review here @nagisa. With Edvard's departure, we will have to figure out who can now pick this work up.
Do you know what the contribution to the perf improvement is from the num_shards change vs the CRC disable?
This seems like a good question. It appears to me that the above experiments combined both the changes together.
Did I miss any discussion about this? Upon a first blush, this seems like a major change with non-trivial implications to many things, among them potentially execution determinism (if we can’t detect data having been corrupted, nodes might start computing different outcomes… maybe that’s fine? I don’t know.)
I think there was some high level discussion in one of the weekly storage meetings. And the initial conclusion was that it should not matter but I think it is worth it to have a discussion in more public forum (zulip) and have the conclusions be documented.
yeah, skipping CRC seems perilous, esp in the light of https://github.com/near/nearcore/issues/7515
Another way to view it is that disk corruption is just one possible corruption. Really we need a system which doesn’t penalise validators for a cosmic rays bit flipping data in their RAM. Having said that, it’s not necessarily that I disagree with the concerns.
Don't think we will working on this PR anymore.