Regex101 icon indicating copy to clipboard operation
Regex101 copied to clipboard

Incorrect Explanation for \W (python)

Open byronharvey opened this issue 3 years ago • 1 comments

Bug Description

For Python 2.7 when I enter into the Regular expression bar r"\W" I am told in the explanation field that "\W matches any non-word character (equivalent to [^a-zA-Z0-9_])" This is not true because é, for example, does not match "\W", and it does for "[^a-zA-Z0-9_]"

Reproduction steps

Enter "\W" in the expression field

In the test field enter é

Change the expression field to "[^a-zA-Z0-9_]"

Observe that "\W" does not highlight é and "[^a-zA-Z0-9_]" does.

Expected Outcome

The explanation should make note that "\W" is not equivalent to "[^a-zA-Z0-9_]" in all cases, particularly those dealing with accented characters commonly found in other languages.

Browser

Include browser name and version Chrome latest (89.0)

OS

Include OS name and version Big Sur

byronharvey avatar May 12 '21 21:05 byronharvey

You're entirely correct, this is due to python being emulated by PCRE on the website, a better 'test' is to use \w, which matches é by default (read: without the /u modifier), this means that python is always getting emulated with the unicode switch of pcre

For pcre the website gives for \W with /u "\W match any non-word character in any script (equivalent to [^\p{L}\p{N}_])" which is probably correct, which python 2.7 does not seem to support, it would be not ideal to list all of the character/code point which would match \W in unicode though if it can't be done in terms of \p{}, not sure about newer version of python anyway, this can be fixed easily but note that python needs some love overall on the website.

Ouims avatar May 16 '21 21:05 Ouims