compile-time-regular-expressions icon indicating copy to clipboard operation
compile-time-regular-expressions copied to clipboard

Atomic sequence support?

Open Andersama opened this issue 5 years ago • 9 comments
trafficstars

Atomic groups much like possessive modifiers assist in preventing excessive backtracking.

If I'm not too far off I think this roughly models an atomic group*. How to modify pcre.hpp to add the ll1 parser is a bit beyond me.

// matching atomic sequence in patterns
template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
	if constexpr (sizeof...(TailContent) > 0) {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, sequence<TailContent...>())) { //find the first match like a regular sequence*
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	} else { //if nothing follows an atomic group / sequence, (?>abc|def) it's like evaluating abc|def
		return evaluate(begin, current, end, captures, ctll::list<HeadContent, Tail...>());
	}
}

Andersama avatar Sep 12 '20 02:09 Andersama

Looking back, slightly off, the function would end up looking like:

template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
	if constexpr (sizeof...(TailContent) > 0) {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, atomic_sequence<TailContent...>>())) { //find the first match like a regular sequence*
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	} else {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent>())) {
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	}
}

Andersama avatar Oct 28 '20 22:10 Andersama

I looked into it and I think this is all needed for atomic support: https://github.com/hanickadot/compile-time-regular-expressions/commit/ce4e8c010b1bfea8d33a754eae236e5ee6643d4d#diff-b459c698ea25ff48eed5ed8109277cb61094f779b437a8942123840d4db469e0R387-R399

hanickadot avatar Nov 09 '20 10:11 hanickadot

Does it not change how certain things inside behave? Wouldn't it change sequences to atomic groups etc? Be glad to be wrong, especially if it was that easy.

Andersama avatar Nov 09 '20 12:11 Andersama

Based on this: https://www.regular-expressions.info/atomic.html the atomic_group behaves as starts_with function, and the rest of regex continues where the inner part ended. But I can be wrong.

hanickadot avatar Nov 09 '20 12:11 hanickadot

The examples it gives w/ a select would probably be a good static assert to test against: a(?>bc|b)c and \b(integer|insert|in)\b

Playing around with a similar example on regex101: a(?>a*(?:bc|b))c abc (no match) abcc (matches) aabc (no match) aabcc (matches)

I think that means it's changing how the parts inside (?>) function. The select inside the atomic group never reaches the second choice (because the first character overlaps). It appears anything inside an atomic group gets the atomic behavior.

Andersama avatar Nov 09 '20 12:11 Andersama

These tests added 263cb1f65fbfe772c5b3748ca6c180cafa61fb2b. Btw that's exactly the essence of bug #136 where inner part of loops behaves as atomic, which is nice optimization but exactly cases like (a|ab)+ fails on ab

hanickadot avatar Nov 09 '20 12:11 hanickadot

Should've probably seen that coming, an atomic group would behave differently if mutually exclusive or not.

Andersama avatar Nov 09 '20 13:11 Andersama

I ran into this issue when writing optimizations for select*

The intuition / fix is something along the lines of this: select<sequence<T...>,sequence<T...,B...>,...> -> select<sequence<T...,optional<B...>>,...>

I think I spotted some code of yours to do w/ testing against select<>'s and empty/epsilons, it's roughly equivalent to: select<sequence<T...>,select<epsilon, sequence<B...>>>

Also makes the other way around make sense as to why that works, (ab|a)+ is a one to one of (ab?)+

Andersama avatar Jan 06 '21 08:01 Andersama

@hanickadot I'm not exactly the best template meta programmer out there*, however if there were a way to order the select<> branches by their starting characters ranges and then by their expression lengths that would be one way to handle this. Might be possible by doing some character length analysis like I've done in #166

Andersama avatar Jan 06 '21 08:01 Andersama