Prefilter regex search with the trigram index
Regular Expression Matching with a Trigram Index describes how you can use a trigram index to prefilter possible candidates and then run the regex only on them.
This is really important, because we don't want to run a regex on every example in the database, that will take forever.
The thing is, I don't understand the article, particularly
- How do you generate all strings that the regex implies must be in any match
- How do you tractably generate all (or some) strings, that should be in a document.
Example regexes
For example the regex
eat(ing)|(er)
Implies that the trigram eat must appear
The regex
\d{2}[/-]\d{2}[/-]\d{2,4}
Describes common date formats, and implies many possible trigrams. If we could enumerate them then we could search for any document that has at least one, and then run the full regex. But I'm not sure this is tractable and I don't know how to generate the trigrams from the regex .
Hacky approximate solution
One thought is to leverage the fact that a regex is a generative grammer, e.g. you can write a regex and generate strings from it. The library RandExp does this.
With that, we could generate n strings, and then find the set of substrings that are common to all the generated strings. This would give us a probabalistic aproximation to 1 and we could probably make use of it as an aproximation to 2, but it's hard to say how well the will work