compile-time-regular-expressions icon indicating copy to clipboard operation
compile-time-regular-expressions copied to clipboard

[WIP] Prefix needle optimization

Open Andersama opened this issue 5 years ago • 3 comments
trafficstars

Adds a bit of machinery to try to extract a leading string<> if we're about to search. Things to improve on

  • better string extraction, currently very simple only looks for string<> and char<> (may be buried in sequence<>)
  • generalization to string like things
  • utf8 strings, boyer moore is bidirectional, utf8 is best handled forwards only

combines both #143 #146 From testing there's anywhere upwards of a 50% performance improvement (when dealing with non-unicode things).

Andersama avatar Nov 15 '20 23:11 Andersama

Playing around with a dfa based implementation, taking some inspiration from hyperscan. https://branchfree.org/2018/05/25/say-hello-to-my-little-friend-sheng-a-small-but-fast-deterministic-finite-automaton/

I'm going to be doing a bit more benchmarking because the results are all over the place, but it also seems worth it.

Andersama avatar Nov 18 '20 09:11 Andersama

So long as I can properly fit into the sheng table / 16 or fewer chars it looks like a 50x improvement over the current implementation. Also the dfa approach appears to play nicely in debug mode.

Scratch that I have a bug in the dfa, not matching properly, will likely be significantly slower.

Ok, worked it out, currently the dfa's running about 33% faster than the current approach.

Andersama avatar Nov 18 '20 10:11 Andersama

PIcked up the O'Reilley book to see if there were more optimizations mentioned, there's another similar to this one, should've been intuitive I guess, if you know where a string appears, eg after some fixed length, then you can search for the "embedded string". Also looks like a good candidate for some pattern analysis, eg: if you have a string midway through a regex and the preceding part has a minimum and maximum character consumption count being ==, then you're guaranteed that string appears that many characters into the regex, it's safe to search for the string and then shift over to validate a match. Currently trying to automate building dfa's, but I'll cycle back to this because this already was a big performance boost.

Andersama avatar Nov 23 '20 09:11 Andersama