missing matches in RE alternative cause UTF-8 decode error
If a match target appears in an alternative an error is thrown:
$ ghci
Prelude> importText.RE.PCRE.String
Prelude PCRE.String> r = [re|foo(A${here}(.*)B|C${there}(.*)D)|]
Prelude PCRE.String> allMatches ("foobar" *=~ r)
[]
Prelude PCRE.String> allMatches ("fooAoneB" *=~ r)
[ Match {matchSource = "fooAoneB", .... *** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.0.2.0-CuYMcTBVvnH4p7K8LCU2iN:Text.RE.ZeInternals.Types.Match
Prelude PCRE.String> allMatches ("fooCtwoD" *=~ r)
[ Match {matchSource = "fooCtwoD", ... [same error]
This seems to be related to the branch where the match is not found:
PCRE.String> r = [re|foo(A${here}(.*)B|CD)|]
PCRE.String> allMatches ("foobar" *=~ r)
[]
PCRE.String> allMatches ("fooAbarB" *=~ r)
... valid match, no error ...
PCRE.String> allMatches ("fooCD" *=~ r)
... error as above...
It's possible this is an invalid usage on my part, but I would expect a different type of error than a UTF-8 decoding error. Additionally, I originally had the same match name on both alternatives and got the same error, so I should have had a valid match regardless of which alternative matched.
regex version 1.0.2.0
The bug is still present:
> import Text.RE.PCRE.Text
Text.RE.PCRE.Text> urlRegex = [re|^https?:\/\/.+\/(\w+)(?:\.(\w+))?(?:[\?|#].*)?$|]
Text.RE.PCRE.Text> "a" ?=~ urlRegex
Match {matchSource = "a", captureNames = fromList [], matchArray = array (CaptureOrdinal {getCaptureOrdinal = 1},CaptureOrdinal {getCaptureOrdinal = 0}) []}
Text.RE.PCRE.Text> "https://a/b/c/d" ?=~ urlRegex
Match {matchSource = "https://a/b/c/d", captureNames = fromList [], matchArray = array (CaptureOrdinal {getCaptureOrdinal = 0},CaptureOrdinal {getCaptureOrdinal = 2}) [(CaptureOrdinal {getCaptureOrdinal = 0},Capture {captureSource = "https://a/b/c/d", capturedText = "https://a/b/c/d", captureOffset = 0, captureLength = 15}),(CaptureOrdinal {getCaptureOrdinal = 1},Capture {captureSource = "https://a/b/c/d", capturedText = "d", captureOffset = 14, captureLength = 1}),(CaptureOrdinal {getCaptureOrdinal = 2},*** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.1.0.0-3PHJg3TXXTf4CPb8VxPErs:Text.RE.ZeInternals.Types.Match
package versions:
$ stack ls dependencies | grep regex
regex 1.1.0.0
regex-base 0.94.0.0
regex-pcre-builtin 0.95.1.2.8.43
regex-tdfa 1.3.1.0
regex-with-pcre 1.1.0.0
The same thing seems to be triggered if a group is optional and isn't present in the tested string. In the program below, the optional trailing zero isn't there, so I would expect to get a Nothing from captureTextMaybe.
#!/usr/bin/env stack
{- stack
script
--resolver lts-18.13
-}
{-# LANGUAGE QuasiQuotes #-}
import Text.RE.PCRE.String ((?=~), cp, re)
import Text.RE.Replace (captureTextMaybe)
main =
mapM_ putStrLn $ captureTextMaybe [cp|1|] ("foo" ?=~ [re|^[a-z]+(0)?$|])
The actual result is
$ ./retest
retest: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.1.0.0-FyuON3BA52j97jnO9rbQpX:Text.RE.ZeInternals.Types.Match
This is still in 1.1.0.0 as supplied by Stackage LTS 18.13
I have a similar bug with version 1.1.0.1
{-# LANGUAGE QuasiQuotes #-}
module Main where
import Text.RE.PCRE.String
import Lib
main :: IO ()
main = print $ "abcd, 1234, EFGH" *=~ [re|\s*([\S]*)(,)*|]
With the result
utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
error, called at ./Text/RE/ZeInternals/Types/Match.lhs:249:13 in regex-1.1.0.1-8o4AWE74QTM4hfWvRxPd4N:Text.RE.ZeInternals.Types.Match
CallStack (from -prof):
Text.RE.ZeInternals.Types.Match.utf8_correct_bs.skip (Text/RE/ZeInternals/Types/Match.lhs:(248,5)-(255,39))
Text.RE.ZeInternals.Types.Match.utf8_correct_bs (Text/RE/ZeInternals/Types/Match.lhs:(244,1)-(264,49))
Text.RE.ZeInternals.Types.Match.CAF:lvl27_rjx8 (<no location info>)
It surely comes from the fact that the library try to fetch the Char after the H to compare it to the ,