php-unstructured-text-parser icon indicating copy to clipboard operation
php-unstructured-text-parser copied to clipboard

Using regex thows error

Open ouija opened this issue 1 year ago • 2 comments

Hi there,

Just realized that you released an update a few years ago that allows you to use a regex when targeting data to parse, however, when I try to utilize this, the script appears to be throwing an error:

 Got error 'PHP message: PHP Warning:  preg_match(): Compilation failed: quantifier does not follow a repeatable item at offset 249 in /vendor/aymanrb/php-unstructured-text-parser/src/TextParser.php on line 68
PHP message: PHP Fatal error:  Uncaught TypeError: array_keys() expects parameter 1 to be array, null given in /vendor/aymanrb/php-unstructured-text-parser/src/TextParser.php

I am using $parser->parseText($message)->getParsedRawData(); in conjunction with this, if that helps.

And simply testing trying to extract a phone number from the text, something like +17785542644 using a variable with regex such as {%customer_phone:^\+\d{1,15}$%}

Using a plain variable such as {%customer_phone%} has no issue, only when I attempt to use a regular expression.

Let me know if you have any insights! Thank you.

ouija avatar Aug 04 '23 09:08 ouija

Hey,

Not really sure, Seems like there is a bug in there, I will have to look a bit and debug it. I will try to find sometime for that.

The whole template is already prepared into a single named variables regex string and used to extract the values, so having things like ^/+/$ in the middle doesn't play well, as it seems my implementation doesn't respect the escaped characters passed in the custom regex.

Maybe you can simplify it to a set of characters regex, for example [+0-9]{1,15} should work (I know it's not exactly what you tried to reach with your regex, but unless you really need to strictly match the case, this should work for the meantime.

aymanrb avatar Aug 05 '23 21:08 aymanrb

There is an issue within the code where it inadvertently eliminates all backslashes from the given pattern. Consequently, this prevents us from successfully identifying and extracting specific characters, such as the "+" sign in a country code within a phone number, as illustrated in your example.

The resolution for this is provided in #38.

I have also made a modification to the t_8.txt template file to reflect this case. Would you kindly @ouija check whether these adjustments align with your intended use case ?

aymanrb avatar Aug 07 '23 08:08 aymanrb