tantivy
tantivy copied to clipboard
NOT Operator with Phrase Query Returns Empty Results
When using Tantivy (v0.22) to index and query email data, the NOT operator with a phrase query returns an empty result set, even when matching documents exist.
let mut builder = Schema::builder();
let account = builder.add_text_field("account", STRING | STORED | FAST);
let mailbox = builder.add_text_field("mailbox", STRING | STORED | FAST);
let subject = builder.add_text_field("subject", custom()); // ngram3 tokenizer
fn custom() -> TextOptions {
TextFieldIndexing::default()
.set_tokenizer("ngram3")
.set_index_option(IndexRecordOption::WithFreqsAndPositions)
.into()
.set_stored()
}
Steps to Reproduce:
Index documents with fields account, mailbox, and subject. Run query: account:asdasd AND mailbox:iuhiuhihu AND subject:"苹果手机" Returns correct documents with the phrase "苹果手机" in subject. Run query: account:asdasd AND mailbox:iuhiuhihu AND NOT subject:"苹果手机" Returns empty results, even though documents matching account:asdasd AND mailbox:iuhiuhihu exist without "苹果手机" in subject.
Expected Behavior:
The second query should return documents where account:asdasd and mailbox:iuhiuhihu match, and subject does not contain "苹果手机".
Actual Behavior:
Empty result set.
Additional Observations:
Base query account:asdasd AND mailbox:iuhiuhihu works as expected. NOT subject:苹果 (single term) also returns an empty set. Environment:
Tantivy: 0.22 Rust: 1.84.1
How can I query for documents where subject does not contain the phrase "苹果手机"? Willing to provide sample data or logs if needed.
Hello @inboxsphere
The problem is not coming from the handling of the NOT operator, but is coming from a bad interaction with the ngram tokenizer.
The ngram tokenizer is used both at indexing time and at query time.
subject:"苹果手机" -> subject:ngram1 OR "subject:ngram2" OR ...
I suspect you created your ngram tokenizer as NgramTokenizer::all_grams(1, 3)?
If you change the tokenizer, it might work as intended. There are several chinese tokenizer available for tantivy.
For instance, you could use lindera (the crate is named lindera-tantivy) with a chinese tokenizer.
Another possible approach would be to write a tokenizer that just emits all kanjis as tokens and use phrase queries.
@fulmicoton I’m currently using NgramTokenizer::new(3, 3, false) for the subject field. My use case involves emails in various mainstream languages worldwide (e.g., English, Chinese, Japanese, etc.), often mixed within a single email (like Chinese-English combos). I can’t specify a tokenizer per email. Is there a tokenizer that can effectively handle such multilingual and mixed-language scenarios?