tantivy
tantivy copied to clipboard
What is valid Tantivy Syntax?
I've consulted https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html, and it lists some valid search options, but it doesn't explain what the "syntax boundaries" are.
I don't know how to avoid SyntaxErrors when searching with,
{
'occur': 'must',
'normal': {
'ctx': f'{fieldname}:{query}',
},
}
invalid query: SyntaxError("comment:https://website.org")invalid query: SyntaxError("comment:\nwords")- Syntax errors with when query begins with
-. Beginning with+is ok. + 2does not search records containing+ 2. Instead it looks for records with2.- I cannot escape characters like
+or-with\. - All non-alphanumeric characters seem to be ignored in my searches.
So stripping non-alphanumerics from the front and end of strings isn't gonna work. I don't know what to do with user input.
More cases which help demonstrate how a solution isn't as trivial as adding " around user search term(s).
Fail:
devices usb-c adapter""https://wiki.installgentoo.comfollowing:>>>">>>"\>\>\># this gives "invalid query: AllButQueryForbidden"- ">>>" # ^
"devices usb-c adapter" ""
Success:
devices usb-c adapter" ""https://wiki.installgentoo.com""following:"2021-08-22"devices usb-c adapter" " "
Success but no results (when matching documents exist):
".\n"...\.\.\."..."
@trinity-1686a Didn't we have some function that escapes user terms and could be used as documentation?
to be clear, that's an LNX issue, that relies on tantivy 0.18, right? the query parser was entirely rewritten over the last year and a half, with some newly added features such as better behavior around escape sequences overall, and an opt-in lenient parser which attempts to recover from invalid syntax
AllButQueryForbidden is an error caused somewhere further, when a query has no positive component. Queries with only negative component are more expensive than they look, so they must be done on purpose by adding a "match all" part to the query (or better an actual a positive filter), for instance turning -key:value into * -key:value
I don't know which tokenizer LNX uses (or if it even is customizable), but some requests won't do what you hope depending on the tokenizer. With the default tokenizer, documents are split on word boundaries, so abc def-ghi. will be made searchable as ["abc","def","ghi"]. Query terms also get tokenized, which means field:"def.ghi" would match, even while it uses a dot instead of a dash. The four "Success but no results" can get tokenized into nothingness using something like the SimpleTokenizer as they don't contain any alphanumeric character (with unicode definition of alphanumeric, so farsi letters and chinese ideograms would be considered letters)