Carp icon indicating copy to clipboard operation
Carp copied to clipboard

Proposal: Regular expression support

Open johnwcowan opened this issue 4 years ago • 2 comments

The official justification for providing Lua patterns (which do not support alternation) rather than regular expressions is that alternation can only be supported with backtracking, which is inherently slow. This, however, is not actually the case. Thompson's algorithm, which is used by grep, lex, and awk among other tools, converts a regular expression to a NFA machine. but rather than interpreting it using backtracking, or converting it ahead of time to a DFA machine (which requires exponential time and space), it is converted on the fly into partial DFA machines that run in O(n) time. These are represented as specialized bytecode which are then cached. In addition, when the pattern is a constant, compilation can be done at compile time.

The RE2 regular expression library provides an efficient regex engine for all PCRE features except backreferences, which cannot be done without backtracking. It's written in C++, but has a variety of wrappers, notably the C wrapper, which would allow it to be used with Carp. Alternatively, it could be rewritten in Carp following Russ Cox's four-part blog post.

johnwcowan avatar Oct 20 '21 22:10 johnwcowan

@johnwcowan Thanks for the information, that is very good to know!

eriksvedang avatar Oct 21 '21 06:10 eriksvedang

From my POV, any regex implementation, whether as wrapper or natively, would be very welcome. I implemented patterns back in the day, and I think we could even switch the underlying technology without changing the API too much.

The only real reason we use patterns was that the lua lib was simple, fast, and portable. As for the backtacking using alternation: I thought that I just copied this verbatim from the Lua documentation. I’m certainly no expert in implementing regex engines (russ cox’s treatise on regexes has been on my reading list for ages), so I’d have to trust you when you say there is a better way :) This was another point for the Lua engine: I was able to understand it, top to bottom, within little more than an afternoon. Of course this point is now moot as I forgot nearly all the details about it (only the big picture remains).

I also think that relying on an external library that gets independent updates might be better than using something that is part of a bigger distribution—as we do with Lua right now—, since that increases the maintenance burden on updates.

hellerve avatar Oct 21 '21 15:10 hellerve