tantivy-py
tantivy-py copied to clipboard
`writer.delete_documents` on tokenized fields behaves unintuitively
add three document,doc_id = test-1, test-2, test-3; use writer.delete_documents(field_name="doc_id", field_value="test-1") writer.commit() writer.wait_merging_threads() index.reload()
test-1 can still be found through search...
If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/
I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:
import tantivy
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
docs = ["1", "2", "3"]
for d in docs:
writer.add_document(
tantivy.Document(
doc=d
)
)
writer.commit()
index.reload()
searcher = index.searcher()
print("Before delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
print("After delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
Output is:
Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Refreshing the searcher gives the expected behaviour:
print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**
print("After delete:")
for d in docs:
query = tantivy.Query.term_query(schema, "doc", d)
top_docs = searcher.search(query)
print(top_docs)
gives:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
Maybe just user error, then.
Both I and the original reporter were bitten by term-queries not matching values with characters like - and _, so the documents weren't actually deleted.
If you change 1,2,3 to 1, 2_2, 3-3 in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.
This illustrates it:
import tantivy
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)
writer = index.writer()
docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
writer.add_document(
tantivy.Document(
doc=d
)
)
writer.commit()
index.reload()
searcher = index.searcher()
print("Before delete:")
for d in docs:
query = index.parse_query("doc:{}".format(d), ["doc"])
top_docs = searcher.search(query)
print(top_docs)
print("Deleting everything:")
for d in docs:
print(writer.delete_documents("doc", d))
writer.commit()
index.reload()
searcher = index.searcher()
print("After delete:")
for d in docs:
query = index.parse_query("doc:{}".format(d), ["doc"])
top_docs = searcher.search(query)
print(top_docs)
which gives:
Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Just a quick follow up that changing the tokenizer to raw gives the expected behavior
schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")
and everything is deleted:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
@Fudge Thanks for looking at this ❤️
The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.
I haven't looked into this yet but I've been following your investigation. I wonder whether we can:
- inside
delete_documents(), - if the target field is text,
- and the field has a tokenizer,
- apply the tokenizer to the given values,
- and then continue to call the underlying delete function.
Does this sound like it would fix the issue?
I've been using delete_documents in tantivy always on int fields (like a doc_id field) and that always works. So yeah sounds like the tokenizer is the issue here.
I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0 would also delete version-1.10 and version-11.0 for example.
delete_documents() on tokenized text fields is not intuitive, and should come with a warning. :-)
Even for integer fields by default delete_documents does not work. It turns out for all non-text field there is an indexed flag which needs to be enabled for deletion to work.
schema_builder.add_integer_field("doc_id", stored=True, indexed=True)
there is an indexed flag which needs to be enabled for deletion to work.
Is this also how tantivy works? If so, we're not going to change the behaviour although we could certainly add documentation to warn about it.
The behaviour you show for non-indexed fields is different to what this issue is about thought, which has to do with how the field tokenizer affects match during delete. I'll edit the issue title to make that clear. The non-indexed behaviour should either be a separate issue, or if this happens also with upstream tantivy, an issue there. I suspect they will mark it as a documentation issue though.
You are right for non-text fields this is simply a documentation issue.
I have merged #456. This change introduces delete_documents_by_query(), which does do tokenization, and delete_documents_by_term() which does not do tokenization. The latter is in fact the old delete_documents() method, with a new name. I have put a deprecation notice on the delete_documents() method and we'll remove that interface in a future version.