ctpg Is it expected behavior that the regular expression "[0-9]+456" successfully recognize "123"?

Oct 23 '25 11:10 95833

It should not. Can you elaborate? Example?

Oct 23 '25 13:10 peter-winter

#include <iostream>
#include <charconv>

#include "ctpg.h"

using namespace ctpg;
using namespace ctpg::buffers;

constexpr nterm<int> list("list");

constexpr char number_pattern[] = "[0-9]+12";
constexpr regex_term<number_pattern> number("number");

int to_int(std::string_view sv)
{
    int i = 0;
    std::from_chars(sv.data(), sv.data() + sv.size(), i);
    return i;
}

constexpr parser p(
    list,
    terms(',', number),
    nterms(list),
    rules(
        list(number) >=
            to_int,
        list(list, ',', number)
            >= [](int sum, char, const auto& n){ return sum + to_int(n); }
    )
);

int main(int argc, char* argv[])
{
    if (argc < 2)
        return -1;
    auto res = p.parse(string_buffer(argv[1]), std::cerr);
    bool success = res.has_value();
    if (success)
        std::cout << res.value() << std::endl;
    return success ? 0 : -1;
}

This is your example. I only modified the regular expression for number so that it matches only numbers ending with 12, but in fact it still matches all numbers.

Oct 23 '25 13:10 95833

In addition, I’m not sure whether it is expected behavior that the regular expression . also matches newline characters.

Oct 23 '25 13:10 95833

It would be ideal to support ^ and $ for recognizing the beginning and end of the string.

Oct 23 '25 15:10 95833

Can you set verbose to true in parse_options? See the output what actually happens?

Oct 23 '25 16:10 peter-winter

/mnt/e/Desktop/test$ ./test 10,20,30
[1:1] REGEX MATCH: Current char 1 [1:1] REGEX MATCH: New state 3 [1:2] REGEX MATCH: Recognized 1 [1:2] REGEX MATCH: Current char 0 [1:2] REGEX MATCH: New state 3 [1:3] REGEX MATCH: Recognized 1 [1:1] PARSE: Recognized number [1:1] PARSE: Shift to 2, term: 10 [1:3] REGEX MATCH: Current char , [1:3] REGEX MATCH: New state 1 [1:4] REGEX MATCH: Recognized 0 [1:3] PARSE: Recognized , [1:3] PARSE: Reduced using rule 0 list <- number [1:3] PARSE: Go to 1 [1:3] PARSE: Shift to 3, term: , [1:4] REGEX MATCH: Current char 2 [1:4] REGEX MATCH: New state 3 [1:5] REGEX MATCH: Recognized 1 [1:5] REGEX MATCH: Current char 0 [1:5] REGEX MATCH: New state 3 [1:6] REGEX MATCH: Recognized 1 [1:4] PARSE: Recognized number [1:4] PARSE: Shift to 4, term: 20 [1:6] REGEX MATCH: Current char , [1:6] REGEX MATCH: New state 1 [1:7] REGEX MATCH: Recognized 0 [1:6] PARSE: Recognized , [1:6] PARSE: Reduced using rule 1 list <- list , number [1:6] PARSE: Go to 1 [1:6] PARSE: Shift to 3, term: , [1:7] REGEX MATCH: Current char 3 [1:7] REGEX MATCH: New state 3 [1:8] REGEX MATCH: Recognized 1 [1:8] REGEX MATCH: Current char 0 [1:8] REGEX MATCH: New state 3 [1:9] REGEX MATCH: Recognized 1 [1:7] PARSE: Recognized number [1:7] PARSE: Shift to 4, term: 30 [1:9] PARSE: Recognized [1:9] PARSE: Reduced using rule 1 list <- list , number [1:9] PARSE: Go to 1 [1:9] PARSE: Recognized [1:9] PARSE: Success 60

Another example might better illustrate the reason: using the regular expression ".*any_ending" will read all the text and never stop, because . matches any character.

Oct 23 '25 16:10 95833

Fundamentally, I suspect the matching strategy may involve consuming one character from the regular expression and attempting to match it against the input string until it fails, before advancing to the next regex token. A more accurate matching strategy may involve consuming one or more additional branching tokens from the regular expression to determine the appropriate parsing path. I don't know.

Oct 23 '25 16:10 95833

Regardless, a custom lexer might be more practical, so whether this feature gets fixed or not doesn't really matter.

Oct 23 '25 16:10 95833

You found a bug. The regex to finite state machine conversion code looks like spaghetti, and concatenation for some reason is not unit tested at all.

Oct 23 '25 16:10 peter-winter