compile-time-regular-expressions
compile-time-regular-expressions copied to clipboard
[WIP] Prefix needle optimization
Adds a bit of machinery to try to extract a leading string<> if we're about to search. Things to improve on
- better string extraction, currently very simple only looks for string<> and char<> (may be buried in sequence<>)
- generalization to string like things
- utf8 strings, boyer moore is bidirectional, utf8 is best handled forwards only
combines both #143 #146 From testing there's anywhere upwards of a 50% performance improvement (when dealing with non-unicode things).
Playing around with a dfa based implementation, taking some inspiration from hyperscan. https://branchfree.org/2018/05/25/say-hello-to-my-little-friend-sheng-a-small-but-fast-deterministic-finite-automaton/
I'm going to be doing a bit more benchmarking because the results are all over the place, but it also seems worth it.
So long as I can properly fit into the sheng table / 16 or fewer chars it looks like a 50x improvement over the current implementation. Also the dfa approach appears to play nicely in debug mode.
Scratch that I have a bug in the dfa, not matching properly, will likely be significantly slower.
Ok, worked it out, currently the dfa's running about 33% faster than the current approach.
PIcked up the O'Reilley book to see if there were more optimizations mentioned, there's another similar to this one, should've been intuitive I guess, if you know where a string appears, eg after some fixed length, then you can search for the "embedded string". Also looks like a good candidate for some pattern analysis, eg: if you have a string midway through a regex and the preceding part has a minimum and maximum character consumption count being ==, then you're guaranteed that string appears that many characters into the regex, it's safe to search for the string and then shift over to validate a match. Currently trying to automate building dfa's, but I'll cycle back to this because this already was a big performance boost.