RegexPhraseQuery - multi-term regex match
What's the problem? I'm trying to allow wildcard in the middle of a phrase (it's a word, but I'm tokenizing that word). So, for example I'd love to search for "the great bad wolf", and find it by using "the g*d wolf"
Describe the solution you'd like Expand RegexPhraseQuery to consider multi-term matches (as an option)
[Optional] describe alternatives you've considered
-
Using the existing interface. I could theoretically split this into 3 regex phrase queries:
the g*~0 ANDthe g* wolf~4 AND*d wolf~0 the 3rd expression above would slow down the entire query, having*as a prefix. -
Using slop - doesn't give the same control. It'll give too many unwanted results
the g* *d wolf would match
the great bad wolf (but not the great abc bad wolf )
the g* *d wolfwould matchthe great bad wolf(but notthe great abc bad wolf)
That's why I didn't offer that as an option :)
it also wouldn't catch the good wolf, which you would expect to be found.
Anyway, if the idea is accepted I'd love to add support for that myself even (with some pointers to the right direction on what to look into)
@PSeitz / @cjrh can you point me to where I should be looking at to support this? I'd love to take a stab at this :) Then we can continue the discussion over a PR
The main difference is that you don't know how many tokens a regex may span when it includes a star. That would affect fetching potential terms, which you would need to split at *, when in the middle: https://github.com/quickwit-oss/tantivy/blob/main/src/query/phrase_query/regex_phrase_weight.rs
And the algorithm to intersect positions, which would require some mechanism to allow large gaps in the positions in the phrases, to report a hit: https://github.com/quickwit-oss/tantivy/blob/main/src/query/phrase_query/phrase_scorer.rs
Making the regex generally span multiple tokens, not just *, would be quite hard I think.