regex icon indicating copy to clipboard operation
regex copied to clipboard

Currency symbol abbreviation "Sc" is not a valid property

Open aikow opened this issue 3 years ago • 1 comments

What version of regex are you using?

1.5.4

Describe the bug at a high level.

The unicode abbreviation for for the currency symbols as defined here https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt is not valid and gives a regex parse error.

What are the steps to reproduce the behavior?

Add the regex crate to the cargo.toml file

File: Cargo.toml

[dependencies]
regex = "1.5.4"

File: main.rs

use regex::Regex;

fn main() {
    let regex = Regex::new(r"\p{Sc}").unwrap();
    if regex.is_match("$") {
        println!("Hello, world!");
    }
}

What is the actual behavior?

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    \p{Sc}
    ^^^^^^
error: Unicode property not found
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', src/main.rs:4:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

What is the expected behavior?

Expected the regex to compile and return the "$" as a match.

aikow avatar Feb 03 '22 09:02 aikow

TL;DR - It's a bug and you can work around it for now by using \p{gc=Sc} (or \p{gc:Sc}) to force the class parser to treat Sc as a general category. Although, personally, I would use \p{CurrencySymbol}, since I think that's clearer.

OK, so Sc is an abbreviation for the general category called Currency Symbol, and that's defined by Unicode in UTS#18. So first question is, do we have that general category and abbreviation in our Unicode tables? Yes and yes.

The next thing I noticed is that sc is actually a way to explicitly specify "script extension." So, e.g., \p{Greek} and \p{sc=Greek} are equivalent. So my thinking now is that the special treatment of sc is messing with the general category lookup. I believe this is where cases like \p{xxx} are handled, which to me implies that something is going wrong in the canonical_binary method.

... but that all looks right to me. And I don't see where any special handling of sc might be getting in the way. So perhaps I've assumed wrong somewhere. But this is just all by inspection. I write this down in case someone else wants to take a look at fixing this.

BurntSushi avatar Feb 03 '22 13:02 BurntSushi