Pierre Barre comments

Results 25 comments of


                                            Pierre Barre

Attempt to multiply with overflow

@PSeitz Would a reproduction be useful? I've been thinking about generating a 1B docs segment from a minimal repo to see how things goes.

Attempt to multiply with overflow

> @Barre is there a rationale to having such gigantic segments? We recommend around 10millions docs per segment. Thanks for the feedback on segment sizes! In my case, 10M would...

Attempt to multiply with overflow

> I don't think this is true. I was specifically thinking about the FST that may become more efficient as more entries it contains. > This is odd. The default...

Attempt to multiply with overflow

Here's how I open my index: ```rust let mut index = IndexBuilder::new() .schema(schema.clone()) .settings(IndexSettings { docstore_compression: tantivy::store::Compressor::Lz4, docstore_compress_dedicated_thread: true, ..default::Default::default() }) .open_or_create(directory)?; let index_writer_options = IndexWriterOptions::builder() .num_merge_threads(num_cpus::get_physical()) .num_worker_threads(num_cpus::get_physical()) .memory_budget_per_thread(1_000_000_000) .build();...

Attempt to multiply with overflow

```rust #[derive(Clone)] pub struct Indexer { pub id: Field, pub text_indexing: Field, pub schema: Schema, pub index: Index, pub index_reader: IndexReader, } impl Indexer { pub fn new() -> anyhow::Result...

Attempt to multiply with overflow

> To get segments that large, you should have overridden the default merge policy, or merged index on your own. You don't have code doing this? I didn't do anything...

Attempt to multiply with overflow

Just an update with logs I got today to show that this happens without doing anything weird with the merge policies: ``` tantivy::indexer::segment_updater] Starting merge - [Seg("990299f8"), Seg("6c119178"), Seg("e56906ea"), Seg("8e9b2d3f"),...

Attempt to multiply with overflow

> The `LogMergePolicy` only checks if a single segment does not exceed 10 Mio docs, but not that a group does not exceed the global threshold of 1 billion. >...

Proposal to Maintain Goofys in Different Fork

If your use case don't require direct fs:s3 objects mapping somewhat 1:1, I published https://github.com/Barre/ZeroFS which should perform much better at the same time!

Add merklemap?

It does not have any other source, as this project aims to be more of a CT monitor than a pure data finder. I'd say that the main benefits would...