tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Regex query doesn't return snippets.

Open noelzubin opened this issue 6 years ago • 8 comments

Regex query doesn't return snippets.

regex query does not implement query_terms fn. When a snippet generator is constructed it stores all the possible terms in a set and and checks the token from the matched doc with it.

Would'nt it be possible to generate the snippet without searching the doc string again instead use the postions returned from the searcher.

or for now can we make the snippet generator to use a more generic matcher to check matches so that instead of comparing terms we could also match a regex.

noelzubin avatar Oct 23 '19 13:10 noelzubin

All you commented is correct. Not many users use both RegexQuery and Snippet so if this is blocking for you I recommend you to pick this ticket yourself.

fulmicoton avatar Oct 24 '19 01:10 fulmicoton

I am interested in helping add this. Do I simply need to overload query_terms? If so, I believe I'd need to extract the term from the regex, which I don't see a good way of doing?

danielhstahl avatar Aug 14 '22 18:08 danielhstahl

@danielhstahl It is a tad trickier than this, because you would need query_terms would have to be a function of the segment_reader and the doc id... And it would be more efficient to do it over several documents at a time...

@danielhstahl @noelzubin Considering the added complexity, I need to know if you have an actual business use case?

fulmicoton avatar Aug 18 '22 16:08 fulmicoton

I need to do partial word match while still doing compound searches. The "QueryParser" only supports TermQueries for single words, so I created a custom parser that uses Regex rather than TermQueries. With the original QueryParser I was also extracting snippets. Obviously with Regex I am not able to extract snippets :).

danielhstahl avatar Aug 18 '22 16:08 danielhstahl

Would it be more economical to have specialized a PartialTermQuery which only supports matching substrings of terms without the full generatlity of regular expressions (but possibly using AutomatonWeight internally)?

adamreichold avatar Aug 18 '22 16:08 adamreichold

Would it be more economical to have specialized a PartialTermQuery which only supports matching substrings of terms without the full generatlity of regular expressions (but possibly using AutomatonWeight internally)?

That would work for my use case, yes

danielhstahl avatar Aug 18 '22 16:08 danielhstahl

Would it be more economical to have specialized a PartialTermQuery which only supports matching substrings of terms without the full generatlity of regular expressions (but possibly using AutomatonWeight internally)?

Where is the actual matching done for the TermQuery currently? I'm looking through the code base but don't see where this is implemented.

danielhstahl avatar Aug 22 '22 14:08 danielhstahl

Where is the actual matching done for the TermQuery currently? I'm looking through the code base but don't see where this is implemented.

Not an expert on the code base, but I think you are looking for https://github.com/quickwit-oss/tantivy/blob/main/src/query/term_query/term_weight.rs#L116.

However, I think TermQuery is too specialized for implementing PartialTermQuery and would therefore suggest basing it on AutomatonWeight as used by RegexQuery, i.e. basically querying for .*<term>.* but making use of the knowledge that <term> was searched which is what cannot be derived from a general regular expression.

adamreichold avatar Aug 22 '22 15:08 adamreichold