tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

add RegexPhraseQuery

Open PSeitz opened this issue 1 year ago • 0 comments

RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a Posting that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one.

Renamed Union to BufferedUnionScorer to better differentiate from other unions. Added proptests to test different union types.

Comparison With Lucene

Below is a benchmark from the search benchmark game on the wikipedia dataset. Lucene does not seem to support Regex Phrase Queries with Slop, so it's not part of the benchmark.

Interestingly the slower "grad* ma*ent admission test" is due to the regex evaluation on the FST.

regex_perf

PSeitz avatar Oct 10 '24 09:10 PSeitz