lexgen Implement a way to bind match results in patterns

Suppose I want to parse \xHH syntax in a rule (H is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:

"\\x" $hex_digit $hex_digit => |mut lexer| {
    let match_ = lexer.match_();
    let bytes = match_.as_bytes();

    let digit1 = bytes[bytes.len() - 2];
    let digit2 = bytes[bytes.len() - 1];

    ...
},

The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is "blah blah \xAA" then match will be the entire string except the closing ".

One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.

If we had a way to bind regexes inside patterns we could do:

"\\x" <d1:$hex_digit> <d2:$hex_digit> => |mut lexer| {
    ...
},

where d1 and d2 would be chars.

Jun 19 '21 12:06 osa1

Here's another use case. Rust integers can be lexed like this:

rule Init {
    ...

    ("0b" | "0o" | "0x")? ($digit | '_')* $id? =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(match_))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(match_: &'input str) -> Result<Token, CustomError> { ... }

With this implementation, check_int needs to split the string into prefix, digits, and suffix, even though we already did that in the lexer.

With the proposed feature, it could be implemented as:

rule Init {
    ...

    <prefix:("0b" | "0o" | "0x")?> <digits:($digit | '_')*> <suffix:$id?> =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(prefix, digits, suffix))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(prefix: &'input str, digits: &'input str, suffix: &'input str) -> Result<Token, CustomError> { ... }

Oct 30 '21 07:10 osa1

I'm not sure how to implement this feature yet, but here's an idea. Suppose I have a rule with these regexes:

<a:"aaa"> <b:"a"> <c:"x"> => ...,
<a:"a"> <b:"aaa"> <c:"y"> => ...,

In the generated code, after seeing 4 as, we will need a "match stack" with contents:

[
    [(0, 3), (3, 4)], // bound matches for the first rule
    [(0, 1), (1, 4)], // bound matches for the second rule
]

After 4 as, if the next character is x, we extend the first match stack with (4, 5) and run the semantic action with the ranges in the list bound. If the next character is y then do the same using the second list.

Conceptually, it seems like we could annotate NFA nodes with "start match" and "end match". In the example above, initially we will have 2 NFAs:

regex 1: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(x)-> 5 (accept)
regex 2: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(y)-> 5 (accept)

We add the captures to these NFAs like this:

regex 1: 0 (init, start match) -(a)-> 1 -(a)-> 2 -(a)-> 3 (end match, start match) -> -(a)-> 4 (end match, start match) -(x)-> 5 (end match, accept)
regex 2: 0 (init, start match) -(a)-> 1 (end match, start match) -(a)-> 2 -(a)-> 3 -> -(a)-> 4 (end match, start match) -(y)-> 5 (end match, accept)

We will also need the regex number for these "start match" and "end match" annotations, to be able to update the right match stack in the match stack list when starting/ending a match.

Once we have these, we can generate the DFA like this:

    0 (init, start match 1, start match 2)      match stacks = [(0, ?)], [(0, ?)]
-(a)->
    1 (end match 2, start match 2)              match stacks = [(0, ?)], [(0, 1), (1, ?)]
-(a)->
    2
-(a)->
    3 (end match 1, start match 1)              match stacks = [(0, 3), (3, ?)], [(0, 1), (1, ?)]
-(a)->
    4 (end match 1 2, start match 1 2)          match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, ?)]

(Notation is a bit strange, but hopefully conveys the idea)

Now in state 4 if we see an x:

-(x)->
    5 (end match 1, accept)                     match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, 5)]

Now we bind the variables in the first regex to the matches in the second match stack and run the semantic action. Same thing if we see an y, except we bind variables using the first mark stack.

Regexes that don't bind variables do not need a match stack, so the max. number of stacks will be quite small for real-world examples. I doubt we will see more than 5-6 stacks at a time.

We know size of the each match stack in compile time, so we can allocate each match stack with the right capacity.

Lexers that don't use this feature will still be allocation-free. Others will allocate a vector every time we start matching a regex that uses this feature. In that sense this feature is "zero cost" (i.e. you don't pay a price if you don't use it).

Match stacks are reset when we run a semantic action. We remove the match stack used in the semantic action and clear the list.

TODO: Need to figure out how to implement backtracking.

Nov 01 '21 07:11 osa1

It seems difficult to support this syntax with full generality. For example

('a' <x:'b'>? 'c')+ => ...

Here type of x is Vec<Option<&'input str>>. To maintain this match stack, we need to push it when we enter the state for this regex (instead of entering a captured regex, like in the example in my previous comment above), and then every time we see a 'b' we push a Some(...) and every time we see a 'c' we push a None.

No idea how to generalize the idea in my previous comment to handle this case.

Nov 01 '21 10:11 osa1