regex icon indicating copy to clipboard operation
regex copied to clipboard

Documentation for hir::Repetition is misleading

Open nwalfield opened this issue 5 years ago • 5 comments

Consider the following code:

use regex_syntax::hir;

fn main() {
    // Is this: xab+y or x(ab)+y?
    let xabplusy = hir::Hir::concat(vec![
        hir::Hir::literal(hir::Literal::Unicode('x')),
        hir::Hir::repetition(hir::Repetition {
            kind: hir::RepetitionKind::OneOrMore,
            greedy: true,
            hir: Box::new(
                hir::Hir::concat(vec![
                    hir::Hir::literal(hir::Literal::Unicode('a')),
                    hir::Hir::literal(hir::Literal::Unicode('b')),
                ])
            )
        }),
        hir::Hir::literal(hir::Literal::Unicode('x'))
    ]);

    let regex = xabplusy.to_string();
    eprintln!("{}", regex);
}

Running the above code yields: xab+y

The documentation for Repetition says:

hir: Box<Hir>

The expression being repeated.

This leads me to believe that the whole Hir will be repeated (in this case ab). But, at least in the case of a Hir::concat, it appears that only the last Hir in the vector (b) is repeated.

nwalfield avatar Dec 25 '20 22:12 nwalfield

Is this an issue with how Hir is rendered? Looking at Writer::visit_pre https://github.com/rust-lang/regex/blob/04e025b86144bbdf41425fef4a1d06161dc645d7/regex-syntax/src/hir/print.rs#L86-L91 and Writer::visit_post https://github.com/rust-lang/regex/blob/04e025b86144bbdf41425fef4a1d06161dc645d7/regex-syntax/src/hir/print.rs#L173-L180 repetitions should probably be wrapped in parentheses.

mingyli avatar Dec 28 '20 19:12 mingyli

It looks like alternations have the same issue:

use regex_syntax::hir;

fn main() {
    // Is this: xab+y or x(ab)+y?
    let xabplusy = hir::Hir::concat(vec![
        hir::Hir::literal(hir::Literal::Unicode('x')),
        hir::Hir::repetition(hir::Repetition {
            kind: hir::RepetitionKind::OneOrMore,
            greedy: true,
            hir: Box::new(hir::Hir::concat(vec![
                hir::Hir::literal(hir::Literal::Unicode('a')),
                hir::Hir::literal(hir::Literal::Unicode('b')),
            ])),
        }),
        hir::Hir::alternation(vec![
            hir::Hir::concat(vec![
                hir::Hir::literal(hir::Literal::Unicode('f')),
                hir::Hir::literal(hir::Literal::Unicode('g')),
            ]),
            hir::Hir::literal(hir::Literal::Unicode('h')),
        ]),
        hir::Hir::literal(hir::Literal::Unicode('x')),
    ]);

    let regex = xabplusy.to_string();
    eprintln!("{}", regex);
}

yields xab+fg|hx

mingyli avatar Dec 28 '20 19:12 mingyli

Yeah I think this is probably a bug in the printer. I don't think this kind of ambiguity is produced when roundtripping from the concrete syntax, because groups (even non-capturing groups) aren't flattened. Arguably they should be via the Hir's smart constructors, which would surface this bug more readily.

BurntSushi avatar Dec 28 '20 19:12 BurntSushi

This ended up being a duplicate of #516. And a fix is incoming. See https://github.com/rust-lang/regex/issues/516#issuecomment-1230606293 for more details.

BurntSushi avatar Aug 29 '22 17:08 BurntSushi

Thanks for taking the time to work on this.

nwalfield avatar Aug 29 '22 17:08 nwalfield