tantivy Missing search results with in memory index

trafficstars

I've been working on a small command line app to merge bibtex files and have been using tantivy to index paper titles. I noticed some strange behavior where if I (from python extension):

Create an in memory index
Call writer.add_document in a for loop
Call writer.commit() outside the for loop The search results I'd expect to be exact matches (duplicate detection) are not showing up. I reran the same code several times and every so often I do get the correct results, which suggested to me that on commit, some index entries were getting lost. If I have writer.commit inside the for loop, the issue goes away. Likewise, if I use a file-based index, the issue also goes away.

Reading the docs, it looks like even with an in memory index, calling writer.commit() after the for loop and not in it should work, is there something I'm missing or might this be a bug?

If it's not a known issue already and you cant reproduce it, once I finish with my cli I can share the relevant code.

Jan 23 '23 23:01 EntilZha

Thanks for the report. Can you provide something to reproduce that behavior in rust? I'm not very familiar with python

Do you reload your IndexReader after committing? It may have a view on the old segments.

Jan 24 '23 07:01 PSeitz

Hello @PSeitz I've run into a similar problem. here is a code snippet to reproduce the behavior:

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = Schema::builder();
    let title = schema_builder.add_text_field("title", TEXT | STORED);
    let schema = schema_builder.build();

    let index = Index::create_in_ram(schema);

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into()?;

    let mut index_writer = index.writer(10_000_000)?;
    index_writer.add_document(doc!(
        title => "The Old Man and the Sea"
    ))?;
    index_writer.commit()?;

    let query_parser = QueryParser::for_index(&index, vec![title]);
    let query = query_parser.parse_query("sea")?;

    let searcher = reader.searcher();
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;

    assert!(!top_docs.is_empty());

    Ok(())
}

I guess the problem is ReloadPolicy::OnCommit taking a few milliseconds to reload. I solved this problem by changing the reload policy to manual and calling reader.reload() after index_writer.commit().

Jul 01 '23 04:07 K0UR05H

Yes. You can reload the index manually though.

@PSeitz You added the good first issue label? Do you think we should do something here?

Jul 03 '23 03:07 fulmicoton

Shouldn't the reload be triggered sync?

Jul 03 '23 03:07 PSeitz

Hi, I would like to take this as a good first issue and work on it. Thanks in advance

Jul 12 '23 07:07 GentBinaku

@GentBinaku Thanks, as a starting point the callback in the RamDirectory is https://github.com/quickwit-oss/tantivy/blob/main/src/directory/ram_directory.rs#L234, which updates the IndexReader. Ideally the update would be done before returning from commit(). If that's not possible we should document it.

Jul 12 '23 08:07 PSeitz

Hi, I dug into the issue, and here is what I found:

the behavior is present even if you use mmap_directory, it's hidden by the test config (this line const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 } else { 500 }); in file_watcher.rs) that executes the reload thread every 1 ms
The fix on ram directory could be easy, in the implementation of atomic_write you have to add a wait() for watch_router.broadcast(), doing so you align it with the one implemented in file_watchers (let _ = callbacks.broadcast().wait();)

As of today, the standard behavior is to reload the index asynchronously on commit, with a different delay based on the dir implementation. Is it worth making the reload synchronous?

Doing so would require some changes (such as changing or dropping FileWatcher) and I cannot evaluate the consequences.

While analyzing the code I noticed that the watching logic for the directory is used only by the reloadPolicy feature. If there is no plan to add more events it could be simplified leading to cleaner code. I'd like to give it a try if you decide it's worth pursuing.

P.S. I've started digging in the code base for a few hours so I may have missed something.

Oct 08 '23 06:10 giovannicuccu

tantivy tantivy copied to clipboard

Missing search results with in memory index

tantivy
tantivy copied to clipboard