Results 25 comments of Pierre Barre

@PSeitz Would a reproduction be useful? I've been thinking about generating a 1B docs segment from a minimal repo to see how things goes.

> @Barre is there a rationale to having such gigantic segments? We recommend around 10millions docs per segment. Thanks for the feedback on segment sizes! In my case, 10M would...

> I don't think this is true. I was specifically thinking about the FST that may become more efficient as more entries it contains. > This is odd. The default...

Here's how I open my index: ```rust let mut index = IndexBuilder::new() .schema(schema.clone()) .settings(IndexSettings { docstore_compression: tantivy::store::Compressor::Lz4, docstore_compress_dedicated_thread: true, ..default::Default::default() }) .open_or_create(directory)?; let index_writer_options = IndexWriterOptions::builder() .num_merge_threads(num_cpus::get_physical()) .num_worker_threads(num_cpus::get_physical()) .memory_budget_per_thread(1_000_000_000) .build();...

```rust #[derive(Clone)] pub struct Indexer { pub id: Field, pub text_indexing: Field, pub schema: Schema, pub index: Index, pub index_reader: IndexReader, } impl Indexer { pub fn new() -> anyhow::Result...

> To get segments that large, you should have overridden the default merge policy, or merged index on your own. You don't have code doing this? I didn't do anything...

Just an update with logs I got today to show that this happens without doing anything weird with the merge policies: ``` tantivy::indexer::segment_updater] Starting merge - [Seg("990299f8"), Seg("6c119178"), Seg("e56906ea"), Seg("8e9b2d3f"),...

> The `LogMergePolicy` only checks if a single segment does not exceed 10 Mio docs, but not that a group does not exceed the global threshold of 1 billion. >...

If your use case don't require direct fs:s3 objects mapping somewhat 1:1, I published https://github.com/Barre/ZeroFS which should perform much better at the same time!

It does not have any other source, as this project aims to be more of a CT monitor than a pure data finder. I'd say that the main benefits would...