regex
regex copied to clipboard
Currency symbol abbreviation "Sc" is not a valid property
What version of regex are you using?
1.5.4
Describe the bug at a high level.
The unicode abbreviation for for the currency symbols as defined here https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt is not valid and gives a regex parse error.
What are the steps to reproduce the behavior?
Add the regex crate to the cargo.toml file
File: Cargo.toml
[dependencies]
regex = "1.5.4"
File: main.rs
use regex::Regex;
fn main() {
let regex = Regex::new(r"\p{Sc}").unwrap();
if regex.is_match("$") {
println!("Hello, world!");
}
}
What is the actual behavior?
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
\p{Sc}
^^^^^^
error: Unicode property not found
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', src/main.rs:4:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
What is the expected behavior?
Expected the regex to compile and return the "$" as a match.
TL;DR - It's a bug and you can work around it for now by using \p{gc=Sc}
(or \p{gc:Sc}
) to force the class parser to treat Sc
as a general category. Although, personally, I would use \p{CurrencySymbol}
, since I think that's clearer.
OK, so Sc
is an abbreviation for the general category called Currency Symbol
, and that's defined by Unicode in UTS#18. So first question is, do we have that general category and abbreviation in our Unicode tables? Yes and yes.
The next thing I noticed is that sc
is actually a way to explicitly specify "script extension." So, e.g., \p{Greek}
and \p{sc=Greek}
are equivalent. So my thinking now is that the special treatment of sc
is messing with the general category lookup. I believe this is where cases like \p{xxx}
are handled, which to me implies that something is going wrong in the canonical_binary
method.
... but that all looks right to me. And I don't see where any special handling of sc
might be getting in the way. So perhaps I've assumed wrong somewhere. But this is just all by inspection. I write this down in case someone else wants to take a look at fixing this.