byexample
byexample copied to clipboard
Use a Thompson NFA instead of the Python default re (regex) engine if it is possible
As most of the languages, interpreters, libs, and most of anything out there, Python uses a classical engine that can have an exponential performance in the worst case (known as 'pathological cases').
A non-deterministic finite automata (NFA) can be used to represent any regular expression (without backrefereces). The use of this to implement a regular expression engine is known as the Thompson NFA algorithm.
It is used in the well known grep and awk implementations and it has a linear performance.
The idea is to replace the implementation of re.compile
and the subsequent calls to the re
module and objects by a NFA implementation.
However, as said before, the NFA will be possible only if the regex has no backreferences.
Byexample uses the backreferences when a given captured tag with label foo appears twice or more times in the expected string.
In those cases the byexample must fall back to the re
module.
The Thompson NFA must not introduce any new mandatory dependency but it can require an optional one.
Interesting reading: https://swtch.com/~rsc/regexp/regexp1.html
I couldn't find a 3rd party lib for implementing that:
- is faster or at least have the same performance than the
re
module - supports the same regex syntax that
re
supports
Can you offer https://xysun.github.io/posts/regex-parsing-thompsons-algorithm.html to join this discussion? 😄
It's good to see that the spirit of Thompson is still alive!
Yes, indeed, if a NFA is needed I will ask him to give me a hand on this.
For now, it is not super urgent: byexample
only uses some variants of the .*
regex and with this, I implemented a quite simple linear algorithm which is fair enough.
But if in the future we want to add more complex regexs like \d{2,5}
at least now I know where I can start :D
Nice!