regex
regex copied to clipboard
Regex syntax parsing of unicode code points is incorrect
What version of regex are you using?
Latest
If it isn't the latest version, then please upgrade and check whether the bug is still present.
Describe the bug at a high level.
Because regex_syntax is lazily using char::from_u32
not all valid unicode code points are parsed, and this prevents valid regex's from compiling.
Give a brief description of the actual problem you're observing.
Rust defines char as a "Unicode scalar value" and explicitly states that it's similar but not the same as a unicode code point.
The parser is supposed to extract all code points as documented above the function: https://github.com/rust-lang/regex/blob/master/regex-syntax/src/ast/parse.rs#L1611
What is the expected behavior?
I expect this crate to include custom logic for validating code points, instead relying on char::from_u32
which omits valid code points/surrogate values because they aren't considered scalar values.
Javascript and several other regex engines can handle these fine.
Please provide an example. Make sure the example include the desired match semantics. That is, your example should include a call to Regex::find
along with the string to search and the expected result.
Also, in addition to an example, please describe a use case in which you would use this new feature.
Due to inactivity from the OP, it's hard to tell what the actual issue here is.
Now, looking at the docs, the regex crate does use the words "Unicode code point" when the more precisely correct terminology is "Unicode scalar value." It overall doesn't make sense for the regex crate to support literally all code points since it operates in UTF-8 land.
So I'll mark this as a doc bug, although I don't think there is a big problem with saying Unicode code point, as it is less obscure than Unicode scalar value.