regex Regex syntax parsing of unicode code points is incorrect

Regex syntax parsing of unicode code points is incorrect

Open dtzxporter opened this issue 2 years ago • 3 comments

What version of regex are you using?

Latest

If it isn't the latest version, then please upgrade and check whether the bug is still present.

Describe the bug at a high level.

Because regex_syntax is lazily using char::from_u32 not all valid unicode code points are parsed, and this prevents valid regex's from compiling.

Give a brief description of the actual problem you're observing.

Rust defines char as a "Unicode scalar value" and explicitly states that it's similar but not the same as a unicode code point.

The parser is supposed to extract all code points as documented above the function: https://github.com/rust-lang/regex/blob/master/regex-syntax/src/ast/parse.rs#L1611

What is the expected behavior?

I expect this crate to include custom logic for validating code points, instead relying on char::from_u32 which omits valid code points/surrogate values because they aren't considered scalar values.

Javascript and several other regex engines can handle these fine.

Apr 14 '22 19:04 dtzxporter

Please provide an example. Make sure the example include the desired match semantics. That is, your example should include a call to Regex::find along with the string to search and the expected result.

Apr 14 '22 19:04 BurntSushi

Also, in addition to an example, please describe a use case in which you would use this new feature.

Apr 14 '22 19:04 BurntSushi

Due to inactivity from the OP, it's hard to tell what the actual issue here is.

Now, looking at the docs, the regex crate does use the words "Unicode code point" when the more precisely correct terminology is "Unicode scalar value." It overall doesn't make sense for the regex crate to support literally all code points since it operates in UTF-8 land.

So I'll mark this as a doc bug, although I don't think there is a big problem with saying Unicode code point, as it is less obscure than Unicode scalar value.

May 02 '22 12:05 BurntSushi

regex regex copied to clipboard

Regex syntax parsing of unicode code points is incorrect

What version of regex are you using?

Describe the bug at a high level.

What is the expected behavior?

regex
regex copied to clipboard