Idea: Allow using corpuses to help guide decisions about regex compilation

Open hyperpape opened this issue 1 year ago • 0 comments

When compiling a regex, there are several decisions we make that affect performance that depend on assumptions about the texts we'll be compiling against.

When choosing whether to use a prefix/suffix/infix, we have to decide which ones are profitable to use. The same thing is true of offset checks.

At the moment, we make these decisions in a relatively crude way (we always check suffixes first), but we can do better, by factoring in the length of the suffix vs prefix and whether the letters are common. However, this last bit is context-sensitive. If we're matching against English texts, 'G' is a relatively good prefix. If we're matching DNA sequences, it is probably useless.

The library could expose an API that accepted a regex and set of texts, and tracked statistics about that text in order to guide those decisions.

May 04 '24 20:05 hyperpape