regex icon indicating copy to clipboard operation
regex copied to clipboard

regex-syntax: some way to retain the AST Span of some punctuation marks?

Open kennytm opened this issue 6 months ago • 2 comments

Consider:

use regex_syntax::ast::parse::ParserBuilder;

fn main() {
    let parse = |pattern| {
        ParserBuilder::new()
            .ignore_whitespace(true)
            .build()
            .parse_with_comments(pattern)
            .unwrap()
    };

    let wc_1 = parse("a #c\n|b");
    let wc_2 = parse("a|#c\n b");
    assert_ne!(wc_1, wc_2);
}

The comment #c is attached to different alternatives in the two regex, but the parse output of both are equivalent:

WithComments { 
    ast: Alternation(Alternation { 
        span: Span(Position(o: 0, l: 1, c: 1), Position(o: 7, l: 2, c: 3)), 
        asts: [
            Literal(Literal { 
                span: Span(Position(o: 0, l: 1, c: 1), Position(o: 1, l: 1, c: 2)), 
                kind: Verbatim, 
                c: 'a' 
            }), 
            Literal(Literal { 
                span: Span(Position(o: 6, l: 2, c: 2), Position(o: 7, l: 2, c: 3)), 
                kind: Verbatim, 
                c: 'b' 
            })
        ] 
    }), 
    comments: [
        Comment { 
            span: Span(Position(o: 2, l: 1, c: 3), Position(o: 5, l: 2, c: 1)), 
            comment: "c" 
        }
    ] 
}

$$\overbrace{\overbrace{\Huge\color{red} \texttt{a}\mathstrut}^{\textrm{Literal(0..1)}}{\Huge\color{blue}\texttt{␣ }}\underbrace{\Huge\color{green}\texttt{\# c ↵}\mathstrut}_{\textrm{Comment(2..5)}}{\Huge\color{blue}\texttt{ |}}\overbrace{\Huge\color{red}\texttt{b}\mathstrut}^{\textrm{Literal(6..7)}}}^{\textrm{Alternation(0..7)}}$$

Without knowing the span of the | punctuation we cannot know if the comment should belong to a or b from parse_with_comments() alone. We have to refer back to the original pattern. At which point perhaps it is easier to just write the parser ourselves :shrug:

I think the Ast type itself should include the Span of these marks when their position cannot be inferred, like the | in a|b|c or the , in a{3,100}.

kennytm avatar Jun 13 '25 09:06 kennytm

I agree that there isn't an easy way to get what you want.

If there is a simple solution to this, even if it's a breaking change, then I'm open to patches.

I'm unlikely to work on this myself. And major changes to the parser are probably not worth doing.

One thing left out here is why you want this. What is your use case?

BurntSushi avatar Jun 13 '25 11:06 BurntSushi

If there is a simple solution to this, even if it's a breaking change, then I'm open to patches.

I think maybe a parser option to set the span of the children to cover the maximum extend rather than the minimum would be enough to disambiguate the two cases.

$$\overbrace{ \overbrace{ {\Huge\color{red} \texttt{a}} {\Huge\color{blue}\texttt{ ␣ }} \underbrace{\Huge\color{green}\texttt{\# c ↵}\mathstrut}_{\textrm{Comment(2..5)} }}^{\color{maroon}\textrm{Literal(0..5)}}{\Huge\color{blue}\texttt{ |}}\overbrace{\Huge\color{red}\texttt{b}\mathstrut}^{\textrm{Literal(6..7)}}}^{\textrm{Alternation(0..7)}}$$

One thing left out here is why you want this. What is your use case?

I'm trying to extend rand_regex to support assigning different weights when sampling from alternations. Using x-mode comments seems to be the least invasive option IMO

  # weight = 5
  aaaaa
| # weight = 1
  bbbbb

though I could get around using hack like

RegexBuilder::new()
    .pattern("(?P<branch_a>aaaaa)|(?P<branch_b>bbbbb)")
    .weights([("branch_a", 5), ("branch_b", 1)])
    .build()
    .unwrap();

kennytm avatar Jun 14 '25 08:06 kennytm