tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

tantivy document memory experiment

Open PSeitz opened this issue 1 year ago • 1 comments

Some test regarding the memory consumption of TantivyDocument

Experiment

Parse data set lines and store all Documents in a Vec. hdfs: 3 fields (timestmap, body, severity), raw dataset 22MB gh: all fields in a json field (dynamic mode). raw dataset 2.3MB

Note: The root level in hdfs fields are stored as Field id instead as string

Variant1: TantivyDocumentMedVec

replace Vec in OwnedValue with 32 bit versions of the Vec and drop Facet and Pretokstr

Variant2: DocContainerRef

The nodes all store their data in 2 vecs and just reference the position there

#[derive(Default)]
struct OwnedValueRefContainer {
    nodes: mediumvec::Vec32<ValueContainerRef>,
    node_data: mediumvec::Vec32<u8>,
}

Results

cargo run --example doc_mem
[examples/doc_mem.rs:21:5] std::mem::size_of::<TantivyDocument>() = 24
[examples/doc_mem.rs:22:5] std::mem::size_of::<DocContainerRef>() = 48
[examples/doc_mem.rs:23:5] std::mem::size_of::<OwnedValue>() = 48
[examples/doc_mem.rs:24:5] std::mem::size_of::<OwnedValueMedVec>() = 24
[examples/doc_mem.rs:25:5] std::mem::size_of::<ValueContainerRef>() = 12
[examples/doc_mem.rs:26:5] std::mem::size_of::<mediumvec::vec32::Vec32<u8>>() = 16
Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec "
Peak Memory 27555817 : "hdfs DocContainerRef "
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec "
Peak Memory 3533176 : "gh DocContainerRef "

Conclusion

There should be some easy gains by using 32 bit vecs, which only use 16byte instead of 24 bytes. DocContainerRef could provide additional gains, but adds some complexity.

https://github.com/quickwit-oss/quickwit/issues/4890

PSeitz avatar Apr 23 '24 08:04 PSeitz

Peak Memory 42308307 : "hdfs TantivyDocument"
Peak Memory 28708435 : "hdfs TantivyDocumentMedVec"
Peak Memory 25155841 : "hdfs DocContainerRef"
Peak Memory 25456237 : "hdfs CompactDoc" // Current version in PR https://github.com/quickwit-oss/tantivy/pull/2402
Peak Memory 27857662 : "hdfs RkyvDoc"         // zero deserialization rkyv
Peak Memory 21055858 : "hdfs PostcardDoc" // postcard serialized
Peak Memory 20106059 : "hdfs ZstdDoc"         // postcard + Zstd
Peak Memory 22555843 : "hdfs BinarySerializable"
Peak Memory 25309370 : "hdfs JsonSerialized"
Peak Memory 6555583 : "gh TantivyDocument"
Peak Memory 4668215 : "gh TantivyDocumentMedVec"
Peak Memory 2735326 : "gh DocContainerRef"
Peak Memory 2543967 : "gh CompactDoc"
Peak Memory 3274042 : "gh RkyvDoc"
Peak Memory 2197615 : "gh PostcardDoc"
Peak Memory 862839 : "gh ZstdDoc"
Peak Memory 2325673 : "gh BinarySerialized"
Peak Memory 2508695 : "gh JsonSerialized"

PSeitz avatar May 20 '24 02:05 PSeitz