sled icon indicating copy to clipboard operation
sled copied to clipboard

memory optimization opportunities, impositions for 1.0 data format

Open spacejam opened this issue 4 years ago • 0 comments

fuzzy snapshots may be written in a streaming fashion to reduce memory overhead. When configured with a small node cache size of 1mb and bulk inserting 10 million items sequentially into the db with a value size of 7 bytes and key size of 128 bytes, the pagetable snapshotting operation can cause memory spikes that are significant in relation to the overall usage. This causes a snapshot of 15mb to be written to disk. This is kicked off every 1 million link operations. Fuzzy snapshots can be generated in a way that is much less spiky by streaming them to disk while generating them from the running pagetable.

  6: __rust_maybe_catch_panic (0x55c9d593b3e9)
--------------------------------------------------------------------------------
Command:            ./target/release/stress2 --set-prop=100000000 --entries=100000000 --sequential --val-len=7 --key-len=128
--threads=1 --total-ops=10000000 --flush-every=0
Massif arguments:   (none)
ms_print arguments: massif.out.2909473
--------------------------------------------------------------------------------


    MB
293.3^                                                                       #
     |                                                                       #
     |                                                               @@      #
     |                                                               @       #
     |                                                       @@      @       #
     |                                                       @       @       #
     |                                                       @       @       #
     |                                                       @       @     @:#
     |                                                       @      :@ ::::@:#
     |                                                       @ ::::@:@ : ::@:#
     |                                                  :::@:@ : ::@:@ : ::@:#
     |                               @            ::::::: :@:@ : ::@:@ : ::@:#
     |                               @     ::::::@:: :::: :@:@ : ::@:@ : ::@:#
     |                       @@      @   :::: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                       @       @:::@ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                       @ @@::::@:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |                 :::@@:@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |              @::: :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |       :::@@@@@: : :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
     |  :::::: :@@  @: : :@ :@ @ ::  @:: @ :: :: @:: :::: :@:@ : ::@:@ : ::@:#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   252.8

Breakdown at the final massif snapshot:

  • 82mb from PageState structs created by the fuzzy snapshot thread - can be avoided by streaming
  • 66mb from paged-out Page state on the pagetable - lots of room for optimization, maybe only needs to be lsn & lid (with bit set for distinguishing log/heap)
  • 57mb of CacheInfo structs - can probably be completely avoided
  • 18mb for the buffer that will be used to write the snapshot in serialize - can be avoided by streaming
  • 16mb for Lru Entries (resizing the internal entries vec probably accounts for most of the LRU overhead)
  • 14mb for BTreeSets storing pids on inactive segments (and segments being cleaned) (Can be an ART for way better compression, maybe the radiex_trie crate is good off the shelf)
  • 8mb for the pagetable child slabs
  • 4mb for the pagetable parent slab

Of these, the one that annoys me the most are the Page & CacheInfo blobs. We can be far more efficient here. This requires us to just add a back-ref to nodes in links, which is cheap and can be done quickly before 1.0 and then this can be more aggressively optimized post-1.0. 1.0 was delayed because of the fixed-stride node compression that has provided a great benefit for sequential workloads, but put us a bit behind on the blocking storage changes.

so, before 1.0:

  • [x] easy: batch lru vec resizes much more aggressively
  • [x] rewrite fragmented nodes on page-out
  • [x] add version to snapshots
  • [ ] compress paged-out pagestate info to be a swizzled u64
  • [x] refactor LRU to avoid slots for every page

spacejam avatar Jan 22 '21 14:01 spacejam