tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

What is valid Tantivy Syntax?

Open sky-cake opened this issue 6 months ago • 1 comments

I've consulted https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html, and it lists some valid search options, but it doesn't explain what the "syntax boundaries" are.

I don't know how to avoid SyntaxErrors when searching with,

{
   'occur': 'must',
   'normal': {
     'ctx': f'{fieldname}:{query}',
    },
}
  1. invalid query: SyntaxError("comment:https://website.org")
  2. invalid query: SyntaxError("comment:\nwords")
  3. Syntax errors with when query begins with -. Beginning with + is ok.
  4. + 2 does not search records containing + 2. Instead it looks for records with 2.
  5. I cannot escape characters like + or - with \.
  6. All non-alphanumeric characters seem to be ignored in my searches.

So stripping non-alphanumerics from the front and end of strings isn't gonna work. I don't know what to do with user input.

sky-cake avatar May 16 '25 03:05 sky-cake

More cases which help demonstrate how a solution isn't as trivial as adding " around user search term(s).

Fail:

  1. devices usb-c adapter""
  2. https://wiki.installgentoo.com
  3. following:
  4. >>>
  5. ">>>"
  6. \>\>\> # this gives "invalid query: AllButQueryForbidden"
  7. ">>>" # ^
  8. "devices usb-c adapter" ""

Success:

  1. devices usb-c adapter" "
  2. "https://wiki.installgentoo.com"
  3. "following:"
  4. 2021-08-22
  5. "devices usb-c adapter" " "

Success but no results (when matching documents exist):

  1. ".\n"
  2. ...
  3. \.\.\.
  4. "..."

sky-cake avatar May 16 '25 04:05 sky-cake

@trinity-1686a Didn't we have some function that escapes user terms and could be used as documentation?

PSeitz avatar Jul 16 '25 11:07 PSeitz

to be clear, that's an LNX issue, that relies on tantivy 0.18, right? the query parser was entirely rewritten over the last year and a half, with some newly added features such as better behavior around escape sequences overall, and an opt-in lenient parser which attempts to recover from invalid syntax

AllButQueryForbidden is an error caused somewhere further, when a query has no positive component. Queries with only negative component are more expensive than they look, so they must be done on purpose by adding a "match all" part to the query (or better an actual a positive filter), for instance turning -key:value into * -key:value

I don't know which tokenizer LNX uses (or if it even is customizable), but some requests won't do what you hope depending on the tokenizer. With the default tokenizer, documents are split on word boundaries, so abc def-ghi. will be made searchable as ["abc","def","ghi"]. Query terms also get tokenized, which means field:"def.ghi" would match, even while it uses a dot instead of a dash. The four "Success but no results" can get tokenized into nothingness using something like the SimpleTokenizer as they don't contain any alphanumeric character (with unicode definition of alphanumeric, so farsi letters and chinese ideograms would be considered letters)

trinity-1686a avatar Jul 16 '25 13:07 trinity-1686a