compile-time-regular-expressions icon indicating copy to clipboard operation
compile-time-regular-expressions copied to clipboard

Unrolling selection optimization

Open Andersama opened this issue 5 years ago • 9 comments
trafficstars

Not sure how to go about this with parsing rules, or for all cases. See: http://www.rexegg.com/regex-optimizations.html#staralt

// (greedy) repeat of alternation (?:a|b...)* -> a*(?:b...+a*)*
template <typename R, typename Iterator, typename EndIterator, typename A, typename... Content, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<repeat<0, 0, sequence<select<A,Content...>>>, Tail...>) noexcept {
	return evaluate(begin, current, end, captures, ctll::list<star<A>, star<plus<select<Content...>>, star<A>>, Tail...>());
}

// (greedy) repeat of alternation (?:a|b...)* -> a*(?:b...+a*)*
template <typename R, typename Iterator, typename EndIterator, typename A, typename... Content, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<repeat<0, 0, select<A, Content...>>, Tail...>) noexcept {
	return evaluate(begin, current, end, captures, ctll::list<star<A>, star<plus<select<Content...>>, star<A>>, Tail...>());
}

Pretty sure with some other overloads capture groups would also work. (a|b...)* -> (a)(?:(b...)+(a))*

Andersama avatar Mar 25 '20 05:03 Andersama

Hi, motivation? looks like I'm missing something, is it an issue? 🤷🏻‍♀️

hanickadot avatar Mar 25 '20 05:03 hanickadot

My understanding is that it helps with backtracking. I'm not sure if it generates a different function yet, but presumably where the possessive optimization fails this transformation would also help in limiting backtracking.

Andersama avatar Mar 25 '20 07:03 Andersama

Is it a pull request?

hanickadot avatar Mar 25 '20 07:03 hanickadot

It could be, but it might not be necessary, so far as I can tell though the code being generated is the same for what I've currently tested. It might be that your collides function that transforms repetitions into possessive form might be helping the compiler get to the same end result. The site seems to indicate the issue with backtracking is due to issues of mutual exclusivity between the sub expressions, this might already be handled.

Andersama avatar Mar 25 '20 08:03 Andersama

I was wrong, it does generate different assembly, may be worthwhile to add: https://compiler-explorer.com/z/rrdqvT

For the simple character alternation which is equivalent to [ab]* it may not be worth it the original is a tight loop, but in other cases where A or B are more complex the tight loops for the individual parts may make for faster matches.

Andersama avatar Oct 31 '20 11:10 Andersama

found this: http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop Given the conditions required, this looks like a good use case for my pattern analysis:

This optimization technique is used to optimize repeated alternation of the form (expr1|expr2|...). These expressions are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear matches arise from the underterministic expression (a)*. An example of super-linear match could be found in our backtracking example.

The unrolling the loop technique is based on the hypothesis that in most case, you know in a repeated alternation, which case should be the most usual and which one is exceptional. We will call the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

Which could means something like, match the normal case, if you find a special case, match it then match the normal case again. You notice that part of this syntax could potentially lead to a super-linear match. To avoid a never ending match the following rules should be carefully applied:

  • The start of the special case and the normal case must be mutually exclusive
  • Special must always match at least one character
  • The special expression must be atomic: be careful to the fact that ( special normal* )* could be reduced to (special), which if special is special, this becomes similar to (a*)* which is a undeterministic expression.

If I'm not mistaken to handle greedy optimizations I think you have to test for mutual exclusivity, and if I'm not mistaken that would be: collides(calculate_first(...),calculate_first(...))

Andersama avatar Oct 31 '20 14:10 Andersama

in light of the recent loop issue this is worth of shot

hanickadot avatar Nov 09 '20 09:11 hanickadot

I think I force pushed an update to the pattern analysis pr a week or so ago, think I updated it to handle atomic groups and I think I patched an error for how I was handling repeats. Feel free to take a look.

Also: thinking about it, given how the optimization warns of indeterminate cases, it may be an optimization to check for those patterns and treat them as (a|b...)* to avoid cases which would behave like a*a*

Andersama avatar Nov 10 '20 09:11 Andersama

Must've confused something, I guess I didn't update the pr with what I thought. I have a branch https://github.com/Andersama/compile-time-regular-expressions/tree/new_pattern_analysis That has the latest.

Andersama avatar Nov 10 '20 09:11 Andersama