sled memory optimization opportunities, impositions for 1.0 data format

memory optimization opportunities, impositions for 1.0 data format

Open spacejam opened this issue 4 years ago • 0 comments

fuzzy snapshots may be written in a streaming fashion to reduce memory overhead. When configured with a small node cache size of 1mb and bulk inserting 10 million items sequentially into the db with a value size of 7 bytes and key size of 128 bytes, the pagetable snapshotting operation can cause memory spikes that are significant in relation to the overall usage. This causes a snapshot of 15mb to be written to disk. This is kicked off every 1 million link operations. Fuzzy snapshots can be generated in a way that is much less spiky by streaming them to disk while generating them from the running pagetable.

  6: __rust_maybe_catch_panic (0x55c9d593b3e9)
--------------------------------------------------------------------------------
Command:            ./target/release/stress2 --set-prop=100000000 --entries=100000000 --sequential --val-len=7 --key-len=128
--threads=1 --total-ops=10000000 --flush-every=0
Massif arguments:   (none)
ms_print arguments: massif.out.2909473
--------------------------------------------------------------------------------


    MB
293.3^                                                                       #
     |                                                                       #
     |                                                               @@      #
     |                                                               @       #
     |                                                       @@      @       #
     |                                                       @       @       #
     |                                                       @       @       #
     |                                                       @       @     @:#
     |                                                       @      :@ ::::@:#
     |                                                       @ ::::@:@ : ::@:#
     |                                                  :::@:@ : ::@:@ : ::@:#
     |                               @            ::::::: :@:@ : ::@:@ : ::@:#
     |                               @     ::::::@:: :::: :@:@ : ::@:@ : ::@:#
     |                       @@      @   :::: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                       @       @:::@ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                       @ @@::::@:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                 :::@@:@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |              @::: :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |       :::@@@@@: : :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |  :::::: :@@  @: : :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   252.8

Breakdown at the final massif snapshot:

82mb from PageState structs created by the fuzzy snapshot thread - can be avoided by streaming
66mb from paged-out Page state on the pagetable - lots of room for optimization, maybe only needs to be lsn & lid (with bit set for distinguishing log/heap)
57mb of CacheInfo structs - can probably be completely avoided
18mb for the buffer that will be used to write the snapshot in serialize - can be avoided by streaming
16mb for Lru Entries (resizing the internal entries vec probably accounts for most of the LRU overhead)
14mb for BTreeSets storing pids on inactive segments (and segments being cleaned) (Can be an ART for way better compression, maybe the radiex_trie crate is good off the shelf)
8mb for the pagetable child slabs
4mb for the pagetable parent slab

Of these, the one that annoys me the most are the Page & CacheInfo blobs. We can be far more efficient here. This requires us to just add a back-ref to nodes in links, which is cheap and can be done quickly before 1.0 and then this can be more aggressively optimized post-1.0. 1.0 was delayed because of the fixed-stride node compression that has provided a great benefit for sequential workloads, but put us a bit behind on the blocking storage changes.

so, before 1.0:

[x] easy: batch lru vec resizes much more aggressively
[x] rewrite fragmented nodes on page-out
[x] add version to snapshots
[ ] compress paged-out pagestate info to be a swizzled u64
[x] refactor LRU to avoid slots for every page

Jan 22 '21 14:01 spacejam

sled sled copied to clipboard

memory optimization opportunities, impositions for 1.0 data format

sled
sled copied to clipboard