tantivy document memory experiment
Some test regarding the memory consumption of TantivyDocument
Experiment
Parse data set lines and store all Documents in a Vec.
hdfs: 3 fields (timestmap, body, severity), raw dataset 22MB
gh: all fields in a json field (dynamic mode). raw dataset 2.3MB
Note: The root level in hdfs fields are stored as Field id instead as string
Variant1: TantivyDocumentMedVec
replace Vec in OwnedValue with 32 bit versions of the Vec and drop Facet and Pretokstr
Variant2: DocContainerRef
The nodes all store their data in 2 vecs and just reference the position there
#[derive(Default)]
struct OwnedValueRefContainer {
nodes: mediumvec::Vec32<ValueContainerRef>,
node_data: mediumvec::Vec32<u8>,
}
Results
cargo run --example doc_mem
[examples/doc_mem.rs:21:5] std::mem::size_of::<TantivyDocument>() = 24
[examples/doc_mem.rs:22:5] std::mem::size_of::<DocContainerRef>() = 48
[examples/doc_mem.rs:23:5] std::mem::size_of::<OwnedValue>() = 48
[examples/doc_mem.rs:24:5] std::mem::size_of::<OwnedValueMedVec>() = 24
[examples/doc_mem.rs:25:5] std::mem::size_of::<ValueContainerRef>() = 12
[examples/doc_mem.rs:26:5] std::mem::size_of::<mediumvec::vec32::Vec32<u8>>() = 16
Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec "
Peak Memory 27555817 : "hdfs DocContainerRef "
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec "
Peak Memory 3533176 : "gh DocContainerRef "
Conclusion
There should be some easy gains by using 32 bit vecs, which only use 16byte instead of 24 bytes.
DocContainerRef could provide additional gains, but adds some complexity.
https://github.com/quickwit-oss/quickwit/issues/4890
Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec"
Peak Memory 25155841 : "hdfs DocContainerRef"
Peak Memory 25456237 : "hdfs CompactDoc" // Current version in PR https://github.com/quickwit-oss/tantivy/pull/2402
Peak Memory 27857662 : "hdfs RkyvDoc" // zero deserialization rkyv
Peak Memory 21055858 : "hdfs PostcardDoc" // postcard serialized
Peak Memory 20106059 : "hdfs ZstdDoc" // postcard + Zstd
Peak Memory 22555843 : "hdfs BinarySerializable"
Peak Memory 25309370 : "hdfs JsonSerialized"
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec"
Peak Memory 2735326 : "gh DocContainerRef"
Peak Memory 2543967 : "gh CompactDoc"
Peak Memory 3274042 : "gh RkyvDoc"
Peak Memory 2197615 : "gh PostcardDoc"
Peak Memory 862839 : "gh ZstdDoc"
Peak Memory 2325673 : "gh BinarySerialized"
Peak Memory 2508695 : "gh JsonSerialized"