regex Clarify (or change) extended/whitespace mode treatment of spaces in character classes

I was bitten pretty hard (my fault!) by a subtle difference in eXtended mode's handling of spaces in character classes. I was expecting (except in a much more complicated context) (?x)[ ] to match a single space as it does with pcre2, but that does not seem to be the case (and doesn't seem to be documented?).

In PCRE2, (?x) enables spurious use of whitespace everywhere except in character classes where it is considered to be a literal value (the same way a . is a literal value in a character class, I suppose). To get whitespace in character classes, eXtra eXtended mode can be used: (?xx)[one two] does not match against a space:

> printf 'hello world' | pcre2grep '(?x)hello[ .]world'
hello world¶
> printf 'hello world' | pcre2grep '(?xx)hello[ .]world'
> # did not match

(You can also refer to https://www.regular-expressions.info/freespacing.html)

Mar 29 '20 02:03 mqudsi

This is expected behavior. I don't see any particular reason to change it personally. I'd rather keep the semantics of x mode simple. Certainly, it's not something that can be changed without a breaking change release.

See also #523. It might be good to have a more holistic section on x mode.

Mar 29 '20 02:03 BurntSushi

Tip: if you cargo install --path regex-debug/Cargo.toml, then you can use it to look at the HIR of a regex which will help you debug these sorts of issues in the future, should you need it. For example:

$ regex-debug hir '[a z]'                                                                                                                                                [1/2]
Hir {
    kind: Class(
        Unicode(
            ClassUnicode {
                set: IntervalSet {
                    ranges: [
                        ClassUnicodeRange {
                            start: "0x20",
                            end: "0x20",
                        },
                        ClassUnicodeRange {
                            start: "a",
                            end: "a",
                        },
                        ClassUnicodeRange {
                            start: "z",
                            end: "z",
                        },
                    ],
                },
            },
        ),
    ),
    info: HirInfo {
        bools: 1,
    },
}

And:

$ regex-debug hir '(?x)[a z]'
Hir {
    kind: Class(
        Unicode(
            ClassUnicode {
                set: IntervalSet {
                    ranges: [
                        ClassUnicodeRange {
                            start: "a",
                            end: "a",
                        },
                        ClassUnicodeRange {
                            start: "z",
                            end: "z",
                        },
                    ],
                },
            },
        ),
    ),
    info: HirInfo {
        bools: 1,
    },
}

Mar 29 '20 02:03 BurntSushi

I assumed it was indeed expected, but was just hedging all my bets.

Specifically with regards to #523: that's actually something else I ran into when mitigating this, I was overzealous in replacing my [ ] with [\ ] and ended up with regex errors that didn't make sense, but ended up being because they were not prefixed with (?x). To that end, would you be open to having \ be an accepted escape in both (?x) and (?-x) modes? (It wouldn't need to be a breaking change since it's currently unaccepted or "reserved" depending on how you squint).

Mar 29 '20 02:03 mqudsi

Thanks for the tip! I've been using regex101 to visualize them in PCRE2 mode then make the changes as needed. I wonder if they're open to a rust patch (they have golang).

Mar 29 '20 02:03 mqudsi

regex regex copied to clipboard

Clarify (or change) extended/whitespace mode treatment of spaces in character classes

regex
regex copied to clipboard