forest icon indicating copy to clipboard operation
forest copied to clipboard

Research Spike: LumberjackDB

Open cryptoquick opened this issue 4 years ago • 3 comments
trafficstars

Task summary

An Append-Only ChainStore could reduce how much data is stored within the RocksDB key-value store.

The idea is to append all data to a log file, similar to a Write-Ahead Log (WAL), but intended to be a persistent data store. That's why I call it an Append-Only Immutable Log (AOIL), and since it's used to back a ChainStore, it'd be an Append-Only ChainStore, or AOCS.

The AOILs would need to be split into separate files, especially if we wanted more compact 32-bit offsets, and not all file systems support files many terabytes in size. We would also need to be careful not to exceed an offset of 2^32 bytes: 4294967296, or 4GB. The last item could exceed that value, making the AOIL file slightly longer, as long as the offset does not exceed u32::MAX.

We'd use byte tuples to indicate to places within the files that we can call a Loc, like this:

let (offset, limit, file) :(u32, u32, u16) = loc;

This allows us to store up to (and maybe slightly over) 262 TB; 2^16 files * 2^32 offsets using Loc tuples at a constant 10 bytes in size.

We could use separate AOIL files for keys and values. By separating them, keys could also be omitted from the K/V store, but this might have serious scaling concerns. However, if we needed to, this could be something that's possible, as long as the files are scanned separately. Each key file could hold up to 2^32 / 10 keys; 429,496,729 keys.

I'm not sure how state pruning might occur, but it'd likely require generating a new AOIL alongside the existing one. The Append-Only ChainStore's primary use would be just in optimizing our present use-case of snapshot imports, in addition to a full-node implementation with reduced write-amplification penalties.

Ideally this would offer an interface that's basically a drop-in replacement for a K/V store.

Memory-mapping might also provide some advantages here, but I'm unsure how well the address space would scale to the terabytes.

Sled might still be worth exploring for this use-case, since the index should be much smaller. Sled's write amplification begins to seriously impede its performance past 10GB.

cryptoquick avatar Jun 02 '21 13:06 cryptoquick

Similar design to take ideas off: https://riak.com/assets/bitcask-intro.pdf

creativcoder avatar Jun 02 '21 15:06 creativcoder

Eric mentioned the changes to the blockstore would be made here: https://github.com/ChainSafe/forest/tree/main/ipld/blockstore/src

cryptoquick avatar Jun 02 '21 16:06 cryptoquick

More thoughts on LumberjackDB:

LumberjackDB is a blockchain-optimized K/V DB that has the goal of making running archive nodes as optimal as possible

  • Archive nodes could use many terabytes of storage
  • Sled KV-store with plaintext keys and values pointing to fixed-size byte offsets into a number of lumber.log files (each log file could be many terabytes in size and could be configured to be kept on separate disks)
    • Sled chosen for potential parallelism improvements, but RocksDB could also be investigated
    • Key: [u8]
    • Value: type LOC = (disk: u8, sector: u8, offset: u8, limit: u8)
      • Can be split over multiple disks, assumed to be multiple terabytes in size
      • Each sector is only 4GB in size, and there can only be 256 sectors per file
      • A value can be no larger than the sector size, 4GB (do you have an idea of how large any record value could be, @ec2?)
      • Each log file will have a maximum of 1TB in size
  • In-memory BTreeMap cache with LFU (Least Frequently Used) eviction (might also consider a more sophisticated data structure than BTreeMap, thoughts, @creativcoder?)
  • In-memory WAL (Write-Ahead Log) per sector -- in-memory because if there is power loss, it's a blockchain, so data loss is not generally a concern
  • Zstd dictionary compression (each dictionary trained on in-memory WAL)
    • Each Zstd dictionary also kept in memory, in addition to being written to each lumber.log as the WAL sector is flushed from memory once it reaches a certain size (4GB)
    • Since compression is used, many log files might be much smaller than 1TB, and as such, when new logs are created, a disk storage space check will be needed

Example internal records schema:

  • b"FLUSHED_RECORD_COUNT" = 21 bazillion
  • b"CURRENT_LOG_FILE" = 13
  • b"CURRENT_SECTOR" = 240
  • b"DICT_COUNT" = 500
  • b"DICT_0" = LOC
  • ...
  • b"DICT_499" = LOC

Example external records schema:

  • b"someCID" = LOC
  • ...

Internal records are kept in their own separate database to prevent collisions.

Example config:

[lumberjack]
enabled = true
mem_cache_gb = 4.0
[[disk]]
path = "/run/media/disk01"
size_tb = 7.68
[[disk]]
path = "/run/media/disk02"
size_tb = 7.68

Each disk must always be defined in the same order, otherwise the byte offsets won't match. When the daemon is started, each disk is checked for log files and size, and each dictionary is loaded into memory.

Thoughts @creativcoder @ec2 ? Also, CC @goller, since he taught me everything I know about databases

cryptoquick avatar Jul 04 '21 14:07 cryptoquick

Closing as too speculative.

lemmih avatar Oct 25 '22 08:10 lemmih