couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Inefficient `$regex` Mango `view` queries on indexed fields

Open pgj opened this issue 2 years ago • 1 comments

On the Mango query interface, when the $regex operator is used in a selector with a field that could be backed up by a view (json) index, it is not leveraged. The use of $regex in its current form rather implies a brute-force scan through all the documents of the underlying view, which makes the whole approach inefficient and resource-intensive.

The source of the problem is that no index ranges could be computed for $regex (to narrow down the scan) because regular expressions do not naturally generate an ordering on the set of strings, like for example comparisons do on integers. However, there is a prior art in for example, Lucene where it is possible to achieve that for specialized cases, such as "starts with"-type of expressions, i.e. ^foo.*.

Some more links about how Lucene does it, just for inspiration:

  • https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/RegexpQuery.html
  • https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/AutomatonQuery.html
  • https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
  • https://www.slideshare.net/otisg/finite-state-queries-in-lucene

Another factor that can slow down the evaluation is that documents are included unconditionally, and $regex is not matched against the key itself first and only documents for the matched keys are fetched and returned. Something along the lines of covering indexes.

pgj avatar Sep 25 '23 16:09 pgj

it would be nice (and I think fairly easy) to at least narrow the query with startkey/endkey if the regex does not start with a wildcard. Parsing the regex to determine that sounds like the tricky part.

rnewson avatar Sep 25 '23 20:09 rnewson