psibase icon indicating copy to clipboard operation
psibase copied to clipboard

Triedent Refactor

Open bytemaster opened this issue 1 year ago • 0 comments

This pull request is to start the code review process and is not yet ready to be merged as it has not been tested with psibase integration.

The primary changes a in the following areas:

  1. removing the cache_allocator, object_db, ring_allocator, and region_allocator and gc
  2. adding a new block_allocator, id_allocator and seg_allocator

It maintains the existing API for database so shouldn't require any major changes to the rest of psibase

Motivation

The ring buffer system was a fixed size cache which required a lot of pinned memory. Under heavy load, especially once data no longer fits in RAM, the old system would have the write thread waiting on the background thread which in turn was waiting on the read threads. Transaction rates fell very low and the majority of the time was spent waiting on mutex. There was no good way to know how to size the ring buffers which meant that the region allocator did most of the heavy lifting.

The old system was fragile, requiring sessions to unlock on certain allocations and invalidating the cached reads. Aside from the pinning of Hot/Warm there was no good way to tell the OS how to page. To make matters worse, the hot rings were filled with mostly dead data caused by the churn of allocating and freeing. It took a long period of time for the ring allocator to get around to reusing that RAM causing a waste of scarce pinned pages.

Results

The code in this branch can sustain 2M random reads from 4 threads while doing 200k random writes on a database that is 272GB with 22GB of IDs holding 338M records. The vast majority of segments end up being 99.9% full and there was limited wasted space. At the end of the insertion of 272GB there were only 6GB of segments ready to be reused and a large part of that was each of the 6 threads personal 128MB write segments. Overall less than 5% wasted space. Future updates could easily trim the database down in size if there were too many empty segments. This was on a M3 Macbook Pro with 128GB of RAM.

After creating that large database, I was able to perform 3.8M sequential inserts per second from a single thread, followed by 6M sequential queries per second. I could update sequential keys at 5M keys per second. Doing single-threaded random inserts achieved over 350k per second.

Block Allocator

Allocates data in chunks of 128MB (configurable compile time) Chunks have independent mmap address ranges so new chunks can be allocated without having to remap the entire file Responsible for converting a "location" in a logical range into a segment/offset and resolving the pointer

ID Allocator

Uses the block allocator to reserve space for a growing ID database mlock's the blocks provided Responsible for allocating new IDs in a thread-safe manner and recycling unused ids using similar linked list to old version

Seg Allocator

This is the work horse that builds on the block allocator and id allocator to allocate large segments when any thread needs a new place to write. The segments do not use mlock and use madvise to tune paging based upon whether the segment is being used for allocation or being compacted and can factor in other things such as object density.

The seg_allocator implements sessions which allow a thread to request a read_lock to prevent the allocator from reusing a segment. Requests to access data can only be made via the read_lock which returns an object_ref.

Testing

The code was mostly tested via programs/tdb.cpp and it was built with Thread Sanitizer to remove all detectible data races.

Design

image

bytemaster avatar Dec 09 '23 05:12 bytemaster