compile-time-regular-expressions
compile-time-regular-expressions copied to clipboard
Atomic sequence support?
Atomic groups much like possessive modifiers assist in preventing excessive backtracking.
If I'm not too far off I think this roughly models an atomic group*. How to modify pcre.hpp to add the ll1 parser is a bit beyond me.
// matching atomic sequence in patterns
template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
if constexpr (sizeof...(TailContent) > 0) {
if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, sequence<TailContent...>())) { //find the first match like a regular sequence*
current = r1.get_end_position();
return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
} else {
return not_matched;
}
} else { //if nothing follows an atomic group / sequence, (?>abc|def) it's like evaluating abc|def
return evaluate(begin, current, end, captures, ctll::list<HeadContent, Tail...>());
}
}
Looking back, slightly off, the function would end up looking like:
template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
if constexpr (sizeof...(TailContent) > 0) {
if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, atomic_sequence<TailContent...>>())) { //find the first match like a regular sequence*
current = r1.get_end_position();
return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
} else {
return not_matched;
}
} else {
if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent>())) {
current = r1.get_end_position();
return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
} else {
return not_matched;
}
}
}
I looked into it and I think this is all needed for atomic support: https://github.com/hanickadot/compile-time-regular-expressions/commit/ce4e8c010b1bfea8d33a754eae236e5ee6643d4d#diff-b459c698ea25ff48eed5ed8109277cb61094f779b437a8942123840d4db469e0R387-R399
Does it not change how certain things inside behave? Wouldn't it change sequences to atomic groups etc? Be glad to be wrong, especially if it was that easy.
Based on this: https://www.regular-expressions.info/atomic.html the atomic_group behaves as starts_with function, and the rest of regex continues where the inner part ended. But I can be wrong.
The examples it gives w/ a select would probably be a good static assert to test against:
a(?>bc|b)c and \b(integer|insert|in)\b
Playing around with a similar example on regex101: a(?>a*(?:bc|b))c
abc (no match)
abcc (matches)
aabc (no match)
aabcc (matches)
I think that means it's changing how the parts inside (?>) function. The select inside the atomic group never reaches the second choice (because the first character overlaps). It appears anything inside an atomic group gets the atomic behavior.
These tests added 263cb1f65fbfe772c5b3748ca6c180cafa61fb2b. Btw that's exactly the essence of bug #136 where inner part of loops behaves as atomic, which is nice optimization but exactly cases like (a|ab)+ fails on ab
Should've probably seen that coming, an atomic group would behave differently if mutually exclusive or not.
I ran into this issue when writing optimizations for select*
The intuition / fix is something along the lines of this:
select<sequence<T...>,sequence<T...,B...>,...>
->
select<sequence<T...,optional<B...>>,...>
I think I spotted some code of yours to do w/ testing against select<>'s and empty/epsilons, it's roughly equivalent to:
select<sequence<T...>,select<epsilon, sequence<B...>>>
Also makes the other way around make sense as to why that works, (ab|a)+ is a one to one of (ab?)+
@hanickadot I'm not exactly the best template meta programmer out there*, however if there were a way to order the select<> branches by their starting characters ranges and then by their expression lengths that would be one way to handle this. Might be possible by doing some character length analysis like I've done in #166