regex icon indicating copy to clipboard operation
regex copied to clipboard

Regex syntax parsing of unicode code points is incorrect

Open dtzxporter opened this issue 2 years ago • 3 comments

What version of regex are you using?

Latest

If it isn't the latest version, then please upgrade and check whether the bug is still present.

Describe the bug at a high level.

Because regex_syntax is lazily using char::from_u32 not all valid unicode code points are parsed, and this prevents valid regex's from compiling.

Give a brief description of the actual problem you're observing.

image

Rust defines char as a "Unicode scalar value" and explicitly states that it's similar but not the same as a unicode code point.

The parser is supposed to extract all code points as documented above the function: https://github.com/rust-lang/regex/blob/master/regex-syntax/src/ast/parse.rs#L1611

What is the expected behavior?

I expect this crate to include custom logic for validating code points, instead relying on char::from_u32 which omits valid code points/surrogate values because they aren't considered scalar values.

Javascript and several other regex engines can handle these fine.

dtzxporter avatar Apr 14 '22 19:04 dtzxporter

Please provide an example. Make sure the example include the desired match semantics. That is, your example should include a call to Regex::find along with the string to search and the expected result.

BurntSushi avatar Apr 14 '22 19:04 BurntSushi

Also, in addition to an example, please describe a use case in which you would use this new feature.

BurntSushi avatar Apr 14 '22 19:04 BurntSushi

Due to inactivity from the OP, it's hard to tell what the actual issue here is.

Now, looking at the docs, the regex crate does use the words "Unicode code point" when the more precisely correct terminology is "Unicode scalar value." It overall doesn't make sense for the regex crate to support literally all code points since it operates in UTF-8 land.

So I'll mark this as a doc bug, although I don't think there is a big problem with saying Unicode code point, as it is less obscure than Unicode scalar value.

BurntSushi avatar May 02 '22 12:05 BurntSushi