tantivy
tantivy copied to clipboard
Regex query doesn't return snippets.
Regex query doesn't return snippets.
regex query does not implement query_terms fn.
When a snippet generator is constructed it stores all the possible terms in a set and and checks the token from the matched doc with it.
Would'nt it be possible to generate the snippet without searching the doc string again instead use the postions returned from the searcher.
or for now can we make the snippet generator to use a more generic matcher to check matches so that instead of comparing terms we could also match a regex.
All you commented is correct. Not many users use both RegexQuery and Snippet so if this is blocking for you I recommend you to pick this ticket yourself.
I am interested in helping add this. Do I simply need to overload query_terms? If so, I believe I'd need to extract the term from the regex, which I don't see a good way of doing?
@danielhstahl It is a tad trickier than this, because you would need query_terms would have to be a function of the segment_reader and the doc id... And it would be more efficient to do it over several documents at a time...
@danielhstahl @noelzubin Considering the added complexity, I need to know if you have an actual business use case?
I need to do partial word match while still doing compound searches. The "QueryParser" only supports TermQueries for single words, so I created a custom parser that uses Regex rather than TermQueries. With the original QueryParser I was also extracting snippets. Obviously with Regex I am not able to extract snippets :).
Would it be more economical to have specialized a PartialTermQuery which only supports matching substrings of terms without the full generatlity of regular expressions (but possibly using AutomatonWeight internally)?
Would it be more economical to have specialized a
PartialTermQuerywhich only supports matching substrings of terms without the full generatlity of regular expressions (but possibly usingAutomatonWeightinternally)?
That would work for my use case, yes
Would it be more economical to have specialized a
PartialTermQuerywhich only supports matching substrings of terms without the full generatlity of regular expressions (but possibly usingAutomatonWeightinternally)?
Where is the actual matching done for the TermQuery currently? I'm looking through the code base but don't see where this is implemented.
Where is the actual matching done for the TermQuery currently? I'm looking through the code base but don't see where this is implemented.
Not an expert on the code base, but I think you are looking for https://github.com/quickwit-oss/tantivy/blob/main/src/query/term_query/term_weight.rs#L116.
However, I think TermQuery is too specialized for implementing PartialTermQuery and would therefore suggest basing it on AutomatonWeight as used by RegexQuery, i.e. basically querying for .*<term>.* but making use of the knowledge that <term> was searched which is what cannot be derived from a general regular expression.