pyang icon indicating copy to clipboard operation
pyang copied to clipboard

Regular expression pattern `[\w]` does not recognize `_`(underline)

Open fredgan opened this issue 5 years ago • 2 comments

Hi Martin, pyang will report an error when I use a YANG module which a default value meets the regex pattern restrictions in a leaf node.

As I known, in common regular expression, \w is equivalent to [a-zA-Z0-9_], but It seems the \w doesn't match _ in pyang. The example is as following:

yang module

module mm {
  yang-version 1.1;
  prefix m1;
  namespace "urn:mm";

  leaf le1 {
    type string {
      pattern "[\\w]+";
    }
    default "Tom_and_Jerry";
  }
}

The result:

$ pyang -v
pyang 2.2.1

$ pyang -f yang mm.yang
mm.yang:10: error: the value "Tom_and_Jerry" does not match its base type - pattern mismatch for pattern defined at mm.yang:8
module mm {
  yang-version 1.1;
  prefix m1;
  namespace "urn:mm";

  leaf le1 {
    type string {
      pattern '[\w]+';
    }
    default "Tom_and_Jerry";
  }
}

and if I modify [\\w]+ to [\\w_]+, the error will disappear. then I go deep into the code and find that validate_pattern_expr refers to lxml library.

Refer to XML Schema Regular Expressioins

fredgan avatar Apr 10 '20 08:04 fredgan

See https://www.w3.org/TR/xmlschema-2/#regexs \w is defined as:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)

It seems that libxml2 classifies _ as a punctuation character (see the file xmlunicode.c in libxml2 src).

mbj4668 avatar Jul 01 '20 08:07 mbj4668

Interesting. https://www.regular-expressions.info/shorthand.html (which is not definitive but does seem to be quite authoritative) says this:

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

This seems to be acknowledging that underscore is a "connector punctuation" character but neverthess asserting that it's always included. I guess this is just wrong?

Update: To be clear, "underscore" (U+005F) is "low line" (I didn't know this!). https://www.fileformat.info/info/unicode/char/005f/index.htm

wlupton avatar Jul 01 '20 09:07 wlupton