tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

RegexPhraseQuery - multi-term regex match

Open flow3d opened this issue 4 months ago • 4 comments

What's the problem? I'm trying to allow wildcard in the middle of a phrase (it's a word, but I'm tokenizing that word). So, for example I'd love to search for "the great bad wolf", and find it by using "the g*d wolf"

Describe the solution you'd like Expand RegexPhraseQuery to consider multi-term matches (as an option)

[Optional] describe alternatives you've considered

  1. Using the existing interface. I could theoretically split this into 3 regex phrase queries: the g*~0 AND the g* wolf~4 AND *d wolf~0 the 3rd expression above would slow down the entire query, having * as a prefix.

  2. Using slop - doesn't give the same control. It'll give too many unwanted results

flow3d avatar Oct 04 '25 21:10 flow3d

the g* *d wolf would match the great bad wolf (but not the great abc bad wolf )

PSeitz avatar Oct 05 '25 21:10 PSeitz

the g* *d wolf would match the great bad wolf (but not the great abc bad wolf )

That's why I didn't offer that as an option :) it also wouldn't catch the good wolf, which you would expect to be found.

Anyway, if the idea is accepted I'd love to add support for that myself even (with some pointers to the right direction on what to look into)

flow3d avatar Oct 12 '25 07:10 flow3d

@PSeitz / @cjrh can you point me to where I should be looking at to support this? I'd love to take a stab at this :) Then we can continue the discussion over a PR

flow3d avatar Oct 21 '25 16:10 flow3d

The main difference is that you don't know how many tokens a regex may span when it includes a star. That would affect fetching potential terms, which you would need to split at *, when in the middle: https://github.com/quickwit-oss/tantivy/blob/main/src/query/phrase_query/regex_phrase_weight.rs

And the algorithm to intersect positions, which would require some mechanism to allow large gaps in the positions in the phrases, to report a hit: https://github.com/quickwit-oss/tantivy/blob/main/src/query/phrase_query/phrase_scorer.rs

Making the regex generally span multiple tokens, not just *, would be quite hard I think.

PSeitz-dd avatar Oct 22 '25 08:10 PSeitz-dd