tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

NOT Operator with Phrase Query Returns Empty Results

Open rustmailer opened this issue 9 months ago • 2 comments

When using Tantivy (v0.22) to index and query email data, the NOT operator with a phrase query returns an empty result set, even when matching documents exist.

let mut builder = Schema::builder();
let account = builder.add_text_field("account", STRING | STORED | FAST);
let mailbox = builder.add_text_field("mailbox", STRING | STORED | FAST);
let subject = builder.add_text_field("subject", custom()); // ngram3 tokenizer
fn custom() -> TextOptions {
    TextFieldIndexing::default()
        .set_tokenizer("ngram3")
        .set_index_option(IndexRecordOption::WithFreqsAndPositions)
        .into()
        .set_stored()
}

Steps to Reproduce:

Index documents with fields account, mailbox, and subject. Run query: account:asdasd AND mailbox:iuhiuhihu AND subject:"苹果手机" Returns correct documents with the phrase "苹果手机" in subject. Run query: account:asdasd AND mailbox:iuhiuhihu AND NOT subject:"苹果手机" Returns empty results, even though documents matching account:asdasd AND mailbox:iuhiuhihu exist without "苹果手机" in subject.

Expected Behavior:

The second query should return documents where account:asdasd and mailbox:iuhiuhihu match, and subject does not contain "苹果手机".

Actual Behavior:

Empty result set.

Additional Observations:

Base query account:asdasd AND mailbox:iuhiuhihu works as expected. NOT subject:苹果 (single term) also returns an empty set. Environment:

Tantivy: 0.22 Rust: 1.84.1

How can I query for documents where subject does not contain the phrase "苹果手机"? Willing to provide sample data or logs if needed.

rustmailer avatar Feb 21 '25 08:02 rustmailer

Hello @inboxsphere

The problem is not coming from the handling of the NOT operator, but is coming from a bad interaction with the ngram tokenizer.

The ngram tokenizer is used both at indexing time and at query time.

subject:"苹果手机" -> subject:ngram1 OR "subject:ngram2" OR ...

I suspect you created your ngram tokenizer as NgramTokenizer::all_grams(1, 3)?

If you change the tokenizer, it might work as intended. There are several chinese tokenizer available for tantivy. For instance, you could use lindera (the crate is named lindera-tantivy) with a chinese tokenizer.

Another possible approach would be to write a tokenizer that just emits all kanjis as tokens and use phrase queries.

fulmicoton avatar Feb 21 '25 09:02 fulmicoton

@fulmicoton I’m currently using NgramTokenizer::new(3, 3, false) for the subject field. My use case involves emails in various mainstream languages worldwide (e.g., English, Chinese, Japanese, etc.), often mixed within a single email (like Chinese-English combos). I can’t specify a tokenizer per email. Is there a tokenizer that can effectively handle such multilingual and mixed-language scenarios?

rustmailer avatar Feb 21 '25 09:02 rustmailer