regex icon indicating copy to clipboard operation
regex copied to clipboard

Fix PCRE with UTF-8 data on Windows

Open cdornan opened this issue 8 years ago • 2 comments

  • regex-pcre has never worked with UTF8 data due to #141 (and it was never guaranteed).

  • Currently it is not working on Windows (at least on AppVeyor) and the Windows UTF-8/PCRE tests have been suspended.

  • The current method of fixing up the offsets in regex is hacky and inefficient.

cdornan avatar Jun 04 '17 12:06 cdornan

BTW, this issue was reported at regex-pcre-builtin.

cdornan avatar Jun 08 '17 15:06 cdornan

I ran into issues using PCRE.Text in the presence of unicode ligatures. Platform is Windows.

*Main Lib Text.RE.PCRE.Text> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a first \"hello\" to everyone"  -- OK
*Main Lib Text.RE.PCRE.Text> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a fir\64262 \"llo t\" to everyone"  -- Uh oh

ByteString sort of works, but it looks like it chews up my ligature:

*Main Lib Text.RE.PCRE.ByteString> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a fir\ACK \"hello\" to everyone"

And String just crashes:

*Main Lib Text.RE.PCRE.String> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"*** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at .\Text\RE\ZeInternals\Types\Match.lhs:248:13 in regex-1.1.0.0-H1FPxX1khLGKIhuhwowTFL:Text.RE.ZeInternals.Types.Match

This does work correctly in the TDFA module, however my use case requires non-greedy matching which only appears to be supported by PCRE. My current work around is to use TDFA where I can and then manual non-regex search and replace where I require non-greedy behavior.

goertzenator avatar Aug 20 '20 13:08 goertzenator