Prefilter regex search with the trigram index

Open talolard opened this issue 6 years ago • 0 comments

Regular Expression Matching with a Trigram Index describes how you can use a trigram index to prefilter possible candidates and then run the regex only on them.

This is really important, because we don't want to run a regex on every example in the database, that will take forever.

The thing is, I don't understand the article, particularly

How do you generate all strings that the regex implies must be in any match
How do you tractably generate all (or some) strings, that should be in a document.

Example regexes

For example the regex

eat(ing)|(er)

Implies that the trigram eat must appear

The regex

\d{2}[/-]\d{2}[/-]\d{2,4}

Describes common date formats, and implies many possible trigrams. If we could enumerate them then we could search for any document that has at least one, and then run the full regex. But I'm not sure this is tractable and I don't know how to generate the trigrams from the regex .

Hacky approximate solution

One thought is to leverage the fact that a regex is a generative grammer, e.g. you can write a regex and generate strings from it. The library RandExp does this.

With that, we could generate n strings, and then find the set of substrings that are common to all the generated strings. This would give us a probabalistic aproximation to 1 and we could probably make use of it as an aproximation to 2, but it's hard to say how well the will work

Jul 11 '19 07:07 talolard