regex icon indicating copy to clipboard operation
regex copied to clipboard

missing matches in RE alternative cause UTF-8 decode error

Open kquick opened this issue 6 years ago • 3 comments

If a match target appears in an alternative an error is thrown:

$ ghci
Prelude> importText.RE.PCRE.String
Prelude PCRE.String> r = [re|foo(A${here}(.*)B|C${there}(.*)D)|]
Prelude PCRE.String> allMatches ("foobar" *=~ r)
[]
Prelude PCRE.String> allMatches ("fooAoneB" *=~ r)
[ Match {matchSource = "fooAoneB", .... *** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.0.2.0-CuYMcTBVvnH4p7K8LCU2iN:Text.RE.ZeInternals.Types.Match
Prelude PCRE.String> allMatches ("fooCtwoD" *=~ r)
[ Match {matchSource = "fooCtwoD", ... [same error]

This seems to be related to the branch where the match is not found:

PCRE.String> r = [re|foo(A${here}(.*)B|CD)|]
PCRE.String> allMatches ("foobar" *=~ r)
[]
PCRE.String> allMatches ("fooAbarB" *=~ r)
... valid match, no error ...
PCRE.String> allMatches ("fooCD" *=~ r)
... error as above...

It's possible this is an invalid usage on my part, but I would expect a different type of error than a UTF-8 decoding error. Additionally, I originally had the same match name on both alternatives and got the same error, so I should have had a valid match regardless of which alternative matched.

regex version 1.0.2.0

kquick avatar Oct 20 '19 06:10 kquick

The bug is still present:

> import Text.RE.PCRE.Text

Text.RE.PCRE.Text> urlRegex = [re|^https?:\/\/.+\/(\w+)(?:\.(\w+))?(?:[\?|#].*)?$|]

Text.RE.PCRE.Text> "a" ?=~ urlRegex
Match {matchSource = "a", captureNames = fromList [], matchArray = array (CaptureOrdinal {getCaptureOrdinal = 1},CaptureOrdinal {getCaptureOrdinal = 0}) []}

Text.RE.PCRE.Text> "https://a/b/c/d" ?=~ urlRegex
Match {matchSource = "https://a/b/c/d", captureNames = fromList [], matchArray = array (CaptureOrdinal {getCaptureOrdinal = 0},CaptureOrdinal {getCaptureOrdinal = 2}) [(CaptureOrdinal {getCaptureOrdinal = 0},Capture {captureSource = "https://a/b/c/d", capturedText = "https://a/b/c/d", captureOffset = 0, captureLength = 15}),(CaptureOrdinal {getCaptureOrdinal = 1},Capture {captureSource = "https://a/b/c/d", capturedText = "d", captureOffset = 14, captureLength = 1}),(CaptureOrdinal {getCaptureOrdinal = 2},*** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.1.0.0-3PHJg3TXXTf4CPb8VxPErs:Text.RE.ZeInternals.Types.Match

package versions:

$ stack ls dependencies | grep regex
regex 1.1.0.0
regex-base 0.94.0.0
regex-pcre-builtin 0.95.1.2.8.43
regex-tdfa 1.3.1.0
regex-with-pcre 1.1.0.0

mnn avatar Oct 10 '20 13:10 mnn

The same thing seems to be triggered if a group is optional and isn't present in the tested string. In the program below, the optional trailing zero isn't there, so I would expect to get a Nothing from captureTextMaybe.

#!/usr/bin/env stack
{- stack
   script
   --resolver lts-18.13
-}
{-# LANGUAGE QuasiQuotes #-}

import Text.RE.PCRE.String ((?=~), cp, re)
import Text.RE.Replace (captureTextMaybe)

main =
  mapM_ putStrLn $ captureTextMaybe [cp|1|] ("foo" ?=~ [re|^[a-z]+(0)?$|])

The actual result is

$ ./retest
retest: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at ./Text/RE/ZeInternals/Types/Match.lhs:248:13 in regex-1.1.0.0-FyuON3BA52j97jnO9rbQpX:Text.RE.ZeInternals.Types.Match

This is still in 1.1.0.0 as supplied by Stackage LTS 18.13

jbash avatar Oct 23 '21 19:10 jbash

I have a similar bug with version 1.1.0.1

{-# LANGUAGE QuasiQuotes #-}
module Main where

import Text.RE.PCRE.String

import Lib

main :: IO ()
main = print $ "abcd, 1234, EFGH" *=~ [re|\s*([\S]*)(,)*|]

With the result

utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at ./Text/RE/ZeInternals/Types/Match.lhs:249:13 in regex-1.1.0.1-8o4AWE74QTM4hfWvRxPd4N:Text.RE.ZeInternals.Types.Match
CallStack (from -prof):
  Text.RE.ZeInternals.Types.Match.utf8_correct_bs.skip (Text/RE/ZeInternals/Types/Match.lhs:(248,5)-(255,39))
  Text.RE.ZeInternals.Types.Match.utf8_correct_bs (Text/RE/ZeInternals/Types/Match.lhs:(244,1)-(264,49))
  Text.RE.ZeInternals.Types.Match.CAF:lvl27_rjx8 (<no location info>)

It surely comes from the fact that the library try to fetch the Char after the H to compare it to the ,

Trajjan avatar Mar 04 '22 07:03 Trajjan