byexample Use a Thompson NFA instead of the Python default re (regex) engine if it is possible

Use a Thompson NFA instead of the Python default re (regex) engine if it is possible

Open eldipa opened this issue 7 years ago • 4 comments

As most of the languages, interpreters, libs, and most of anything out there, Python uses a classical engine that can have an exponential performance in the worst case (known as 'pathological cases').

A non-deterministic finite automata (NFA) can be used to represent any regular expression (without backrefereces). The use of this to implement a regular expression engine is known as the Thompson NFA algorithm.

It is used in the well known grep and awk implementations and it has a linear performance.

The idea is to replace the implementation of re.compile and the subsequent calls to the re module and objects by a NFA implementation. However, as said before, the NFA will be possible only if the regex has no backreferences. Byexample uses the backreferences when a given captured tag with label foo appears twice or more times in the expected string. In those cases the byexample must fall back to the re module.

The Thompson NFA must not introduce any new mandatory dependency but it can require an optional one.

Feb 13 '18 23:02 eldipa

Interesting reading: https://swtch.com/~rsc/regexp/regexp1.html

Feb 13 '18 23:02 eldipa

I couldn't find a 3rd party lib for implementing that:

is faster or at least have the same performance than the re module
supports the same regex syntax that re supports

Feb 26 '18 12:02 eldipa

Can you offer https://xysun.github.io/posts/regex-parsing-thompsons-algorithm.html to join this discussion? 😄

Jul 19 '18 00:07 jilljenn

It's good to see that the spirit of Thompson is still alive!

Yes, indeed, if a NFA is needed I will ask him to give me a hand on this.

For now, it is not super urgent: byexample only uses some variants of the .* regex and with this, I implemented a quite simple linear algorithm which is fair enough.

But if in the future we want to add more complex regexs like \d{2,5} at least now I know where I can start :D

Nice!

Jul 20 '18 02:07 eldipa

byexample byexample copied to clipboard

Use a Thompson NFA instead of the Python default re (regex) engine if it is possible

byexample
byexample copied to clipboard