regex
regex copied to clipboard
Fix PCRE with UTF-8 data on Windows
-
regex-pcrehas never worked with UTF8 data due to #141 (and it was never guaranteed). -
Currently it is not working on Windows (at least on AppVeyor) and the Windows UTF-8/PCRE tests have been suspended.
-
The current method of fixing up the offsets in
regexis hacky and inefficient.
BTW, this issue was reported at regex-pcre-builtin.
I ran into issues using PCRE.Text in the presence of unicode ligatures. Platform is Windows.
*Main Lib Text.RE.PCRE.Text> "a first hello to everyone" *=~/ [ed|$(hello)///"$1"|]
"a first \"hello\" to everyone" -- OK
*Main Lib Text.RE.PCRE.Text> "a first hello to everyone" *=~/ [ed|$(hello)///"$1"|]
"a fir\64262 \"llo t\" to everyone" -- Uh oh
ByteString sort of works, but it looks like it chews up my ligature:
*Main Lib Text.RE.PCRE.ByteString> "a first hello to everyone" *=~/ [ed|$(hello)///"$1"|]
"a fir\ACK \"hello\" to everyone"
And String just crashes:
*Main Lib Text.RE.PCRE.String> "a first hello to everyone" *=~/ [ed|$(hello)///"$1"|]
"*** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
error, called at .\Text\RE\ZeInternals\Types\Match.lhs:248:13 in regex-1.1.0.0-H1FPxX1khLGKIhuhwowTFL:Text.RE.ZeInternals.Types.Match
This does work correctly in the TDFA module, however my use case requires non-greedy matching which only appears to be supported by PCRE. My current work around is to use TDFA where I can and then manual non-regex search and replace where I require non-greedy behavior.