regex regex parser permits '(?-u)\W' when UTF-8 mode is enabled

When you negate a character class when Unicode mode is disabled, the negation includes all bytes except for what's in the class. Namely, the only way to write a character class over codepoints is when Unicode mode is enabled.

Usually, disabling Unicode means reducing the number of features you can use. For example, (?-u)\pL will fail with a parse error because \pL is fundamentally a Unicode construct with no "ASCII-only" interpretation. However, the "Perl" character classes (\w, \d and \s) all revert to their corresponding ASCII definitions when Unicode mode is disabled.

That's all fine. It's also correct that the negated "Perl" character classes (\W, \D and \S) also revert to their ASCII definitions. That's fine too.

But when you use something like \W when Unicode mode is disabled, then it includes bytes that match invalid UTF-8 (like \xFF, since it isn't a word "character"). This should cause the regex parser to return an error, because the regex parser is supposed to guarantee that you can't build a regex that can match invalid UTF-8 when UTF-8 mode is enabled, regardless of whether Unicode mode is enabled.

Case in point, this code:

fn main() {
    let re = regex::Regex::new(r"(?-u)\W").unwrap();
    println!("{:?}", re.find("☃"));
}

outputs:

Some(Match { text: "☃", start: 0, end: 1 })

Which is clearly wrong. Attempting to slice ☃ at the range 0..1 will result in a panic. The top-level Regex API is not supposed to ever return match offsets that would result in a subslice operation panicking. i.e., The match offsets must always fall on valid UTF-8 code unit boundaries.

Jul 22 '22 23:07 BurntSushi

Decided to investigate this issue, it appears to be a problem with the DFA matching engine, since that's the engine selected as the MatchType. I'll try looking further into this problem, maybe the implementation of DFA has some rogue typo.

Correction: It's probably not the DFA nor this specific implementation of it.

I'm new to contributions, so please forgive me for any mistakes.

Jul 23 '22 00:07 ZombieNub

Aye yeah this is a bug in regex-syntax, not the matching engines. That is, the parser should reject the regex as-is (unless UTF-8 mode is disabled, but it is enabled by default).

Jul 23 '22 00:07 BurntSushi

Thank you, I'll take a look there instead.

Jul 23 '22 00:07 ZombieNub

More specifically, it seems likely that the bug is somewhere in the translator (that is, the AST->HIR code), and somehow these cases are slipping by. It also suggests that whatever detection logic for rejecting invalid UTF-8 matching instructions is not robust enough. There really should be a place in the code that completely prevents any byte class that could match invalid UTF-8. That is, any byte class with bytes outside of the ASCII range should be rejected when UTF-8 mode is disabled.

Note that this is all just conjecture based on memory. I haven't actually looked at the code here in a while.

Thanks for taking a look! :)

Jul 23 '22 00:07 BurntSushi

This is likely a duplicate of #738. Or at least, part of #738.

Aug 09 '22 00:08 BurntSushi