Is it expected behavior that the regular expression "[0-9]+456" successfully recognize "123"?
It should not. Can you elaborate? Example?
#include <iostream>
#include <charconv>
#include "ctpg.h"
using namespace ctpg;
using namespace ctpg::buffers;
constexpr nterm<int> list("list");
constexpr char number_pattern[] = "[0-9]+12";
constexpr regex_term<number_pattern> number("number");
int to_int(std::string_view sv)
{
int i = 0;
std::from_chars(sv.data(), sv.data() + sv.size(), i);
return i;
}
constexpr parser p(
list,
terms(',', number),
nterms(list),
rules(
list(number) >=
to_int,
list(list, ',', number)
>= [](int sum, char, const auto& n){ return sum + to_int(n); }
)
);
int main(int argc, char* argv[])
{
if (argc < 2)
return -1;
auto res = p.parse(string_buffer(argv[1]), std::cerr);
bool success = res.has_value();
if (success)
std::cout << res.value() << std::endl;
return success ? 0 : -1;
}
This is your example. I only modified the regular expression for number so that it matches only numbers ending with 12, but in fact it still matches all numbers.
In addition, I’m not sure whether it is expected behavior that the regular expression . also matches newline characters.
It would be ideal to support ^ and $ for recognizing the beginning and end of the string.
Can you set verbose to true in parse_options? See the output what actually happens?
/mnt/e/Desktop/test$ ./test 10,20,30
[1:1] REGEX MATCH: Current char 1 [1:1] REGEX MATCH: New state 3 [1:2] REGEX MATCH: Recognized 1 [1:2] REGEX MATCH: Current char 0 [1:2] REGEX MATCH: New state 3 [1:3] REGEX MATCH: Recognized 1 [1:1] PARSE: Recognized number [1:1] PARSE: Shift to 2, term: 10 [1:3] REGEX MATCH: Current char , [1:3] REGEX MATCH: New state 1 [1:4] REGEX MATCH: Recognized 0 [1:3] PARSE: Recognized , [1:3] PARSE: Reduced using rule 0 list <- number [1:3] PARSE: Go to 1 [1:3] PARSE: Shift to 3, term: , [1:4] REGEX MATCH: Current char 2 [1:4] REGEX MATCH: New state 3 [1:5] REGEX MATCH: Recognized 1 [1:5] REGEX MATCH: Current char 0 [1:5] REGEX MATCH: New state 3 [1:6] REGEX MATCH: Recognized 1 [1:4] PARSE: Recognized number [1:4] PARSE: Shift to 4, term: 20 [1:6] REGEX MATCH: Current char , [1:6] REGEX MATCH: New state 1 [1:7] REGEX MATCH: Recognized 0 [1:6] PARSE: Recognized , [1:6] PARSE: Reduced using rule 1 list <- list , number [1:6] PARSE: Go to 1 [1:6] PARSE: Shift to 3, term: , [1:7] REGEX MATCH: Current char 3 [1:7] REGEX MATCH: New state 3 [1:8] REGEX MATCH: Recognized 1 [1:8] REGEX MATCH: Current char 0 [1:8] REGEX MATCH: New state 3 [1:9] REGEX MATCH: Recognized 1 [1:7] PARSE: Recognized number [1:7] PARSE: Shift to 4, term: 30 [1:9] PARSE: Recognized[1:9] PARSE: Reduced using rule 1 list <- list , number [1:9] PARSE: Go to 1 [1:9] PARSE: Recognized [1:9] PARSE: Success 60
Another example might better illustrate the reason: using the regular expression ".*any_ending" will read all the text and never stop, because . matches any character.
Fundamentally, I suspect the matching strategy may involve consuming one character from the regular expression and attempting to match it against the input string until it fails, before advancing to the next regex token. A more accurate matching strategy may involve consuming one or more additional branching tokens from the regular expression to determine the appropriate parsing path. I don't know.
Regardless, a custom lexer might be more practical, so whether this feature gets fixed or not doesn't really matter.
You found a bug. The regex to finite state machine conversion code looks like spaghetti, and concatenation for some reason is not unit tested at all.