redb icon indicating copy to clipboard operation
redb copied to clipboard

RFC: File format

Open cberner opened this issue 3 years ago • 3 comments
trafficstars

The file format is documented here and I'm working toward stabilizing it. Please comment with any suggestions for improving it!

cberner avatar Jul 31 '22 00:07 cberner

Definitely ping me when it's documented/started to be documented. I won't have any comments on the btree stuff, but I might have thoughts on general file-formaty concerns.

casey avatar Aug 02 '22 18:08 casey

When a write transaction frees a page it is pushed into a queue, and only reused after all read transactions that could reference it have completed.

  1. I was wondering why it is a list, not a tree. Wouldn't a tree help the database to be smaller? The fact that redb must use consecutive pages, restricts the number of pages that one can use in the free list. If it were an ordered tree, it would be faster to find consecutive free pages to use?

  2. I was also wondering about the way key-value is stored, is the length prepended to the key and values?

  3. Would it be possible to only define the size of the key once by database and therefore store fixed-size types without length overhead? What is the layout of data pages?

  4. How is the BTree designed?

Kerollmops avatar Oct 03 '22 20:10 Kerollmops

1. I was wondering why it is a list, not a tree. Wouldn't a tree help the database to be smaller? The fact that redb must use consecutive pages, restricts the number of pages that one can use in the free list. If it were an ordered tree, it would be faster to find consecutive free pages to use?

The queue is implemented with a b-tree, so yes you're totally right that a tree is better and that's how it works! :) That tree only tracks pending pages that can be freed in the future -- not the currently free pages. Allocations for new data use a different b-tree implementation of a buddy allocator which is briefly described in this part of the design doc

2. I was also wondering about the way key-value is stored, is the length prepended to the key and values?

It's stored in a separate array of offsets, but yes it's effectively prepended. The format of the pages is described here.

3. Would it be possible to only define the size of the key once by database and therefore store fixed-size types without length overhead? What is the layout of data pages?

Yes! That's already implemented for fixed size types (see the link above)

4. How is the BTree designed?

The format of the leaf & internal nodes is described in the link above. Lemme know if you have more questions about

cberner avatar Oct 04 '22 01:10 cberner

Not a comment on the file format itself (looks great and fantastic documentation!), however, I have a general question about file format road map: are you thinking long-term support for direct storage to block devices? I'm working on a SSD/NVMe storage subsystem using the new NVMe Zoned Namespace spec and am evaluating KV implementations that could target direct access to zoned block storage devices. Redb looks well suited to a traditional block device implementation as well as zoned storage, and happy to contribute ideas/code if that's desired!

kpwebb avatar Nov 14 '22 15:11 kpwebb

I haven't given it serious thought yet and don't know very much about NVMe Zoned Namespaces. I'd be open to considering it though, if you want to write up a proposal for how it'd work. One thing I'm averse to is adding dependencies, especially complex ones. Does zoned storage require linking with any big libraries?

cberner avatar Nov 14 '22 22:11 cberner

@cberner super cool! I'll start investigating and will put together a PR. There shouldn't be anything significant in the way of dependencies. I'm building a Rust-native library for ZNS interaction that's very lightweight -- it's all through IOCTLs and normal IO operations. At the core they're just block devices divided into 1GB zones that only allow sequential writes the lib just keeps track of block status and triggers garbage collection when a block needs to be cleared. The way the blocks work should align nicely with your approach to paging.

However ZNS aside, the bigger question is do you have a point of view on working with block devices? I'll probably start there with a PR. Just getting a block device model in place is 90% of what's needed to support zns.

kpwebb avatar Nov 14 '22 22:11 kpwebb

I don't have any experience working with block devices, but as long as the implementation is portable, reasonably simple, and also provides a significant performance improvement then it seems interesting to explore. Do you have a sense of how much faster it could reasonably be expected to perform?

cberner avatar Nov 15 '22 18:11 cberner

Oh, and how does it interface with mmap? redb relies heavily on mmap

cberner avatar Nov 15 '22 18:11 cberner

Cool! mmap should be no problem for reads, but need to look at how writes are handled. The big difference with ZNS is that it only allows sequential/append writes within a given zone (zones are then garbage collected in a single operation and reused). I think that should work with how you're doing things here, but may require from tweaks in the page allocation code to order things correctly.

Regardless, I'll start with some benchmarking code. Block devices (ZNS or otherwise) should offer several benefits over file system data storage (speed as well as latency determinism) but from what I've read about RocksDB/Aerospike etc tuning this stuff is highly application/data specific so having a framework for measurement is crucial. I'll put together a PR for that first and use it as a tool to evaluate other potential changes.

Thanks!

kpwebb avatar Nov 16 '22 14:11 kpwebb