firewood
firewood copied to clipboard
Research: Don't serialize or deserialize nodes
To read data, we currently read nodes (starting at the root) and for each node we construct a BranchNode object. Then we only read one child address out of that. It would be more efficient to read the node serialized and have a method that figures out from the serialized object what the child address is at a given offset.
One issue with this today is that the node is compacted very tightly. It might make sense to relax this compaction some so that we don't have to pay the cost of serializing. For example, the branch nodes could have the list of children as an array at the end rather than packing empty items together.
It might also be good for the actual data for the node to be in a Cow which could facilitate using memory mapped files for nodes stored on disk and a heap arena for nodes stored in memory.
Suppose we stored node metadata and node values (and perhaps even node paths) as separate objects in the object store. We can then define a fixed allocation for all nodes. This does add a separate read in order to fetch the overflowing path or actual value; but that might be a good thing as we will no longer be reading and deserializing the value unnecessarily. We could then keep track of duplicate values by their hash and then reuse the storage for values with duplicate hashes. Because nodes will have a fixed size, we can quickly index to the specific 8 bytes for the address we are looking for. With branch factor of 16, it does not appear worth while to handle a dynamic vector for pointers to children but if we increase the branch factor then the extra overhead would be worthwhile.
const BRANCH_FACTORS: [u32; 8] = [
1 << 1,
1 << 2,
1 << 3,
1 << 4,
1 << 5,
1 << 6,
1 << 7,
1 << 8,
];
fn main() {
for branch_factor in BRANCH_FACTORS {
let bitmap_size = branch_factor.div_ceil(8);
let array_size = branch_factor * 8;
let total_size = bitmap_size.min(8) // pad to alignment
+ 8 // for inline path (<=63 bits)
// or overflow to 63-bit pointer
// which is sufficient for the majority of platforms
// as only 48-bits are typically used for addressing
+ 8 // for value pointer
+ array_size // for children pointers
+ 32 // for branch hash
;
println!("branch factor = {branch_factor}");
println!("bitmap size = {bitmap_size}");
println!("array size = {array_size}");
println!("branch total = {total_size}");
println!();
}
}
branch factor = 2 bitmap size = 1 array size = 16 branch total = 65
branch factor = 4 bitmap size = 1 array size = 32 branch total = 81
branch factor = 8 bitmap size = 1 array size = 64 branch total = 113
branch factor = 16 bitmap size = 2 array size = 128 branch total = 178
branch factor = 32 bitmap size = 4 array size = 256 branch total = 308
branch factor = 64 bitmap size = 8 array size = 512 branch total = 568
branch factor = 128 bitmap size = 16 array size = 1024 branch total = 1080
branch factor = 256 bitmap size = 32 array size = 2048 branch total = 2104