sled
sled copied to clipboard
memory optimization opportunities, impositions for 1.0 data format
fuzzy snapshots may be written in a streaming fashion to reduce memory overhead. When configured with a small node cache size of 1mb and bulk inserting 10 million items sequentially into the db with a value size of 7 bytes and key size of 128 bytes, the pagetable snapshotting operation can cause memory spikes that are significant in relation to the overall usage. This causes a snapshot of 15mb to be written to disk. This is kicked off every 1 million link operations. Fuzzy snapshots can be generated in a way that is much less spiky by streaming them to disk while generating them from the running pagetable.
6: __rust_maybe_catch_panic (0x55c9d593b3e9)
--------------------------------------------------------------------------------
Command: ./target/release/stress2 --set-prop=100000000 --entries=100000000 --sequential --val-len=7 --key-len=128
--threads=1 --total-ops=10000000 --flush-every=0
Massif arguments: (none)
ms_print arguments: massif.out.2909473
--------------------------------------------------------------------------------
MB
293.3^ #
| #
| @@ #
| @ #
| @@ @ #
| @ @ #
| @ @ #
| @ @ @:#
| @ :@ ::::@:#
| @ ::::@:@ : ::@:#
| :::@:@ : ::@:@ : ::@:#
| @ ::::::: :@:@ : ::@:@ : ::@:#
| @ ::::::@:: :::: :@:@ : ::@:@ : ::@:#
| @@ @ :::: :: @:: :::: :@:@ : ::@:@ : ::@:#
| @ @:::@ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
| @ @@::::@:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
| :::@@:@ @ :: @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
| @::: :@ :@ @ :: @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
| :::@@@@@: : :@ :@ @ :: @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
| :::::: :@@ @: : :@ :@ @ :: @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
0 +----------------------------------------------------------------------->Gi
0 252.8
Breakdown at the final massif snapshot:
- 82mb from PageState structs created by the fuzzy snapshot thread - can be avoided by streaming
- 66mb from paged-out Page state on the pagetable - lots of room for optimization, maybe only needs to be lsn & lid (with bit set for distinguishing log/heap)
- 57mb of CacheInfo structs - can probably be completely avoided
- 18mb for the buffer that will be used to write the snapshot in
serialize
- can be avoided by streaming - 16mb for Lru Entries (resizing the internal entries vec probably accounts for most of the LRU overhead)
- 14mb for BTreeSets storing pids on inactive segments (and segments being cleaned) (Can be an ART for way better compression, maybe the
radiex_trie
crate is good off the shelf) - 8mb for the pagetable child slabs
- 4mb for the pagetable parent slab
Of these, the one that annoys me the most are the Page & CacheInfo blobs. We can be far more efficient here. This requires us to just add a back-ref to nodes in links, which is cheap and can be done quickly before 1.0 and then this can be more aggressively optimized post-1.0. 1.0 was delayed because of the fixed-stride node compression that has provided a great benefit for sequential workloads, but put us a bit behind on the blocking storage changes.
so, before 1.0:
- [x] easy: batch lru vec resizes much more aggressively
- [x] rewrite fragmented nodes on page-out
- [x] add version to snapshots
- [ ] compress paged-out pagestate info to be a swizzled u64
- [x] refactor LRU to avoid slots for every page