compile-time-regular-expressions icon indicating copy to clipboard operation
compile-time-regular-expressions copied to clipboard

Valid regex does not match

Open ghlecl opened this issue 3 years ago • 4 comments
trafficstars

Hello. I have the following regex:

#include "ctll/fixed_string.hpp"
#include "ctre.hpp"

static constexpr auto ptv_regex = ctll::fixed_string(
    "^(\\d+)(?:[vV](\\d+))?[Pp][Tt][Vv](?:(\\d+)|(\\d+)([AaBbCc]?))?(!)?(?:$|\\^(.+))"
);

which I try to match using:

auto found = ctre::search<ptv_regex>( "1v2PTV1a" );

On regex101.com, this regex matches for the python, PCRE and PCRE2 regex "engines", so I believe the regex is valid and OK. With the CTRE library (version 3.6 from vcpkg and that which is available as trunk on compiler explorer today), it does not match.

Sorry that I don't have the knowledge to try and pinpoint the problem. Just thought I would report it.

ghlecl avatar May 27 '22 17:05 ghlecl

Yep, here's a bug. I can reduce it to

#include "ctll/fixed_string.hpp"
#include "ctre.hpp"

static constexpr auto ptv_regex = ctll::fixed_string(
    "(?:a|ab)?"
);

bool x()
{
    auto found = ctre::match<ptv_regex>( "ab" );
    return found;
}

https://godbolt.org/z/jqn9vrrWc

Alcaro avatar May 27 '22 17:05 Alcaro

Thanks for the minimisation, it seems to me there is a deeper problem, and I'm currently don't know how to fix it. Will try to look at it soon.

hanickadot avatar May 27 '22 18:05 hanickadot

It seems to related when you have a common prefix in selection inside a cycle.

hanickadot avatar May 27 '22 18:05 hanickadot

I am not an expert on how this library implements matching, but it seems to me that there are two factors:

  1. Alternations of the form a|ab will always try to match a before ab.
  2. Cycles are implemented by repeatedly calling evaluate on the content of the cycle:
while (less_than_or_infinite<B>(i)) {
    auto inner_result = evaluate(..., ctll::list<Content..., end_cycle_mark>());
    // ... match Tail if inner_result matches
}

The problem is that, at some point, the expression a|ab has to figure that it needs to match ab immediately instead of a, even if it can only match the prefix a. This cannot happen if the content of the cycle is matched as ctll::list<Content..., end_cycle_mark>, because Content has no idea that Tail comes after it and that it might need to backtrack based on Tail's ability to match. The evaluation of Content thinks the regex ends at end_cycle_mark.

I am writing this because I have also implemented compile-time regular expression matching as a hobby-project (inspired by this library, of course) and I have come across the same bug in my code. The solution that i have found involves using a callback function that calls evaluate for the rest of the expression and passing this function to the evaluation method for Content, so that, after matching Content, the same evaluate method can use the callback to check if Tail matches. It's basically a continuation-passing style approach.

iulian-rusu avatar Jun 19 '22 22:06 iulian-rusu