regex icon indicating copy to clipboard operation
regex copied to clipboard

Optimisation opportunity: literals in the middle of regex

Open RReverser opened this issue 7 years ago • 2 comments

This has been discussed in private before, but decided to make it into an actual issue so that it doesn't get lost in time.

Right now, regexes like .*literal and literal.* get optimised by extracting literal prefixes + forward DFA or suffixes + reverse DFA correspondingly, however regexes like .*literal.* are not and are executed as a full DFA, even though they could be a generic case of literal optimisation where literal is detected, found and then both forward and reverse DFA are executed from its boundaries into different directions.

Example benchmark from original discussion: https://gist.github.com/RReverser/a48c9a7332df9ee7bc7abe5f1f708bb7

And numbers showing the current difference:

test with_dots            ... bench:         109 ns/iter (+/- 8)
test with_dots_and_s_mode ... bench:         109 ns/iter (+/- 9)
test without_back_dots    ... bench:         249 ns/iter (+/- 5)
test without_dots         ... bench:          38 ns/iter (+/- 0)
test without_front_dots   ... bench:          56 ns/iter (+/- 1)

RReverser avatar Sep 06 '18 13:09 RReverser

Thanks for filing this! It does look like an optimization is possible in this case. We need to be extremely careful though in how we deal with this to make sure we avoid worst case quadratic time complexity. I don't think this specific example is susceptible to this though since every occurrence of literal is itself guaranteed to be a match.

Note that the current infrastructure in this crate is nowhere near capable of implementing something like this in a maintainable way. I'll keep cases like this in mind while doing a crate overhaul in the coming months (or possibly years).

BurntSushi avatar Sep 06 '18 13:09 BurntSushi

Sure, I looked into code to see how easy it would be to implement this and came to the same conclusion that it would currently require significant changes, but decided to file nevertheless.

Btw, few more thing I've noticed, but not sure if they're worth dedicated issues or should be just left as part of this one:

  1. Even using lazy quantifier like .*?literal.*? seems not to produce any prefixes or suffixes. I guess prefixes are understandable, since apparently .*? still has to return all possible non-newline chars before the literal, but I suppose suffixes could be still generated as an optimisation opportunity? (since .*? in the end can be ignored altogether)
  2. is_match currently just calls into shortest_match, which, in turn, for e.g. pattern literal.* (already optimisable) and string abcliteraldef, still tries to find entire match by executing DFA after the prefix. It looks like another optimisation opportunity to just ignore any nullable patterns at ends of the pattern when we don't care about the actual match location.

RReverser avatar Sep 06 '18 14:09 RReverser