tantivy
tantivy copied to clipboard
Missing search results with in memory index
I've been working on a small command line app to merge bibtex files and have been using tantivy to index paper titles. I noticed some strange behavior where if I (from python extension):
- Create an in memory index
- Call
writer.add_documentin a for loop - Call
writer.commit()outside the for loop The search results I'd expect to be exact matches (duplicate detection) are not showing up. I reran the same code several times and every so often I do get the correct results, which suggested to me that on commit, some index entries were getting lost. If I havewriter.commitinside the for loop, the issue goes away. Likewise, if I use a file-based index, the issue also goes away.
Reading the docs, it looks like even with an in memory index, calling writer.commit() after the for loop and not in it should work, is there something I'm missing or might this be a bug?
If it's not a known issue already and you cant reproduce it, once I finish with my cli I can share the relevant code.
Thanks for the report. Can you provide something to reproduce that behavior in rust? I'm not very familiar with python
Do you reload your IndexReader after committing? It may have a view on the old segments.
Hello @PSeitz I've run into a similar problem. here is a code snippet to reproduce the behavior:
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let mut index_writer = index.writer(10_000_000)?;
index_writer.add_document(doc!(
title => "The Old Man and the Sea"
))?;
index_writer.commit()?;
let query_parser = QueryParser::for_index(&index, vec![title]);
let query = query_parser.parse_query("sea")?;
let searcher = reader.searcher();
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
assert!(!top_docs.is_empty());
Ok(())
}
I guess the problem is ReloadPolicy::OnCommit taking a few milliseconds to reload. I solved this problem by changing the reload policy to manual and calling reader.reload() after index_writer.commit().
Yes. You can reload the index manually though.
@PSeitz You added the good first issue label? Do you think we should do something here?
Shouldn't the reload be triggered sync?
Hi, I would like to take this as a good first issue and work on it. Thanks in advance
@GentBinaku Thanks, as a starting point the callback in the RamDirectory is https://github.com/quickwit-oss/tantivy/blob/main/src/directory/ram_directory.rs#L234, which updates the IndexReader. Ideally the update would be done before returning from commit(). If that's not possible we should document it.
Hi, I dug into the issue, and here is what I found:
- the behavior is present even if you use mmap_directory, it's hidden by the test config (this line
const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 } else { 500 });in file_watcher.rs) that executes the reload thread every 1 ms - The fix on ram directory could be easy, in the implementation of atomic_write you have to add a
wait()forwatch_router.broadcast(), doing so you align it with the one implemented infile_watchers(let _ = callbacks.broadcast().wait();)
As of today, the standard behavior is to reload the index asynchronously on commit, with a different delay based on the dir implementation. Is it worth making the reload synchronous?
Doing so would require some changes (such as changing or dropping FileWatcher) and I cannot evaluate the consequences.
While analyzing the code I noticed that the watching logic for the directory is used only by the reloadPolicy feature. If there is no plan to add more events it could be simplified leading to cleaner code. I'd like to give it a try if you decide it's worth pursuing.
P.S. I've started digging in the code base for a few hours so I may have missed something.