tantivy
tantivy copied to clipboard
Natural language queries exhibit unexpected behavior when processing Chinese text.
Describe the bug
Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like "(Who is Obama) OR (good boy)"
, Tantivy parses it into a BooleanQuery
, with each subquery composed using TermQuery
:
BooleanQuery {
subqueries: [
(Should, BooleanQuery {
subqueries: [
(Should, TermQuery(Term(field=1, type=Str, "who"))),
(Should, TermQuery(Term(field=1, type=Str, "is"))),
(Should, TermQuery(Term(field=1, type=Str, "obama")))
]
}),
(Should, BooleanQuery {
subqueries: [
(Should, TermQuery(Term(field=1, type=Str, "good"))),
(Should, TermQuery(Term(field=1, type=Str, "boy")))
] })
]
}
This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query "(Who is Obama) OR 伊文斯隐瞒秘密"
, Tantivy interprets the Chinese part as a PhraseQuery
:
BooleanQuery {
subqueries: [
(Should, BooleanQuery {
subqueries: [
(Should, TermQuery(Term(field=1, type=Str, "who"))),
(Should, TermQuery(Term(field=1, type=Str, "is"))),
(Should, TermQuery(Term(field=1, type=Str, "obama")))
]
}),
(Should, PhraseQuery {
field: Field(1), phrase_terms: [
(0, Term(field=1, type=Str, "伊文")),
(1, Term(field=1, type=Str, "伊文斯")),
(2, Term(field=1, type=Str, "隐瞒")),
(3, Term(field=1, type=Str, "秘密"))], slop: 0
})
] }
This behavior differs from what we expect. When parsing Chinese, we expect it to also use Should
to combine each individual tokens, as demonstrated below in our expected behavior.
BooleanQuery {
subqueries: [
(Should, BooleanQuery {
subqueries: [
(Should, TermQuery(Term(field=1, type=Str, "who"))),
(Should, TermQuery(Term(field=1, type=Str, "is"))),
(Should, TermQuery(Term(field=1, type=Str, "obama")))
]
}),
(Should, BooleanQuery {
subqueries: [
(Should, TermQuery(Term(field=1, type=Str, "伊文"))),
(Should, TermQuery(Term(field=1, type=Str, "伊文斯"))),
(Should, TermQuery(Term(field=1, type=Str, "隐瞒"))),
(Should, TermQuery(Term(field=1, type=Str, "秘密")))
]
})
] }
Which version of tantivy are you using? Our tantivy-search is based with Tantivy 0.21.1 version.
To Reproduce
In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the default
tokenizer, it treats "伊文斯隐瞒秘密"
as a single token. We have integrated the Cang-jie
and ICU
tokenizers into tantivy-search, which can properly tokenize Chinese text.
To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:
let sentence = "(Who is Obama) OR 伊文斯隐瞒秘密";
let text_query: Box<dyn Query> = parser.parse_query(sentence).unwrap();
println!("{:?}", text_query);