tantivy-py `writer.delete_documents` on tokenized fields behaves unintuitively

add three document，doc_id = test-1， test-2， test-3； use writer.delete_documents(field_name="doc_id", field_value="test-1") writer.commit() writer.wait_merging_threads() index.reload()

test-1 can still be found through search...

Jun 07 '24 11:06 lvjg

If you have time, we would appreciate a working code snippet that is easy to run to reproduce the issue. See http://www.sscce.org/

Jun 07 '24 13:06 cjrh

I only started using Tantivy-py this weekend so I might be doing something wrong, but this doesn't behave as expected from reading the API:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["1", "2", "3"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

Output is:

Before delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

Jun 10 '24 09:06 Fudge

Refreshing the searcher gives the expected behaviour:

print("Deleting 1 and 2")
print(writer.delete_documents("doc", "1"))
print(writer.delete_documents("doc", "2"))
writer.commit()
index.reload()
**searcher = index.searcher()**

print("After delete:")
for d in docs:
    query = tantivy.Query.term_query(schema, "doc", d)
    top_docs = searcher.search(query)
    print(top_docs)

gives:

SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(0.9808292, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
Deleting 1 and 2
8
9
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [(0.28768212, DocAddress { segment_ord: 0, doc: 0 })], count: 1)

Maybe just user error, then.

Jun 10 '24 15:06 Fudge

Both I and the original reporter were bitten by term-queries not matching values with characters like - and _, so the documents weren't actually deleted.

If you change 1,2,3 to 1, 2_2, 3-3 in the example above, only the first document works as expected. Searching for the documents with a phrase query finds the others the attempted delete.

This illustrates it:

import tantivy

schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("doc", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema)

writer = index.writer()

docs = ["11", "1-2", "1_3", "1.4"]
for d in docs:
    writer.add_document(
        tantivy.Document(
            doc=d
        )
    )

writer.commit()
index.reload()
searcher = index.searcher()

print("Before delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

print("Deleting everything:")
for d in docs:
    print(writer.delete_documents("doc", d))

writer.commit()
index.reload()
searcher = index.searcher()

print("After delete:")
for d in docs:
    query = index.parse_query("doc:{}".format(d), ["doc"])
    top_docs = searcher.search(query)
    print(top_docs)

which gives:

Before delete:
SearchResult(hits: [(1.4599355, DocAddress { segment_ord: 1, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 3, doc: 0 })], count: 1)
SearchResult(hits: [(1.474477, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
Deleting everything:
10
11
12
13
After delete:
SearchResult(hits: [], count: 0)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 0, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 2, doc: 0 })], count: 1)
SearchResult(hits: [(1.1143606, DocAddress { segment_ord: 1, doc: 0 })], count: 1)

Jun 10 '24 19:06 Fudge

Just a quick follow up that changing the tokenizer to raw gives the expected behavior

schema_builder.add_text_field("doc", stored=True, tokenizer_name="raw")

and everything is deleted:

SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)
SearchResult(hits: [], count: 0)

Jun 17 '24 09:06 Fudge

@Fudge Thanks for looking at this ❤️

The first thing I am most interested to know is whether the deletion behaviour in tantivy-py behaves differently than the upstream tantivy crate. This might be tricky for you to investigate if you're not used to Rust.

I haven't looked into this yet but I've been following your investigation. I wonder whether we can:

inside delete_documents(),
if the target field is text,
and the field has a tokenizer,
apply the tokenizer to the given values,
and then continue to call the underlying delete function.

Does this sound like it would fix the issue?

I've been using delete_documents in tantivy always on int fields (like a doc_id field) and that always works. So yeah sounds like the tokenizer is the issue here.

Jun 17 '24 10:06 cjrh

I don't think applying the tokenizer to the value would give the expected behavior in this case, as trying to delete version-1.1.0 would also delete version-1.10 and version-11.0 for example.

delete_documents() on tokenized text fields is not intuitive, and should come with a warning. :-)

Jun 17 '24 13:06 Fudge

Even for integer fields by default delete_documents does not work. It turns out for all non-text field there is an indexed flag which needs to be enabled for deletion to work.

schema_builder.add_integer_field("doc_id", stored=True, indexed=True)

Aug 22 '24 09:08 tommyip

there is an indexed flag which needs to be enabled for deletion to work.

Is this also how tantivy works? If so, we're not going to change the behaviour although we could certainly add documentation to warn about it.

The behaviour you show for non-indexed fields is different to what this issue is about thought, which has to do with how the field tokenizer affects match during delete. I'll edit the issue title to make that clear. The non-indexed behaviour should either be a separate issue, or if this happens also with upstream tantivy, an issue there. I suspect they will mark it as a documentation issue though.

Aug 22 '24 12:08 cjrh

You are right for non-text fields this is simply a documentation issue.

Aug 22 '24 16:08 tommyip

I have merged #456. This change introduces delete_documents_by_query(), which does do tokenization, and delete_documents_by_term() which does not do tokenization. The latter is in fact the old delete_documents() method, with a new name. I have put a deprecation notice on the delete_documents() method and we'll remove that interface in a future version.

May 06 '25 08:05 cjrh

tantivy-py tantivy-py copied to clipboard

`writer.delete_documents` on tokenized fields behaves unintuitively

tantivy-py
tantivy-py copied to clipboard