IPED icon indicating copy to clipboard operation
IPED copied to clipboard

Highlight regex hits in substrings properly

Open pauloadilson opened this issue 2 years ago • 5 comments

I've tried to add REGEX for IMEI and Licence plates in conf\RegexConfig.txt, but it did not work out after indexing. I added this lines:

IMEI = \b[0-9]{14}[-/][0-9]{1}\b|\b[0-9]{14,15}\b|\b[0-9]{6}[-/][0-9]{2}[-/][0-9]{6}[-/][0-9]{1}\b PLACA_MERCOSUL = \b[A-Z]{3}[0-9]{1}[A-Z0-9]{1}[0-9]{2}\b|\b[A-Z]{3}[-\s][0-9]{1}[A-Z0-9]{1}[0-9]{2}\b

Fist of all, the REGEX did not found same data I had prepared that should correspond to the REGEX above (I tried to validate it in https://regex101.com/)

A search result in METADATA section: image This is the output: image

As one can see, the output do not correspond to the REGEX.

The REGEX for IMEI and Licence plates can be very helpful.

pauloadilson avatar Aug 18 '22 12:08 pauloadilson

Hi,

(1) Unfortunately, \b (word boundary) is supported just at the beginning and at the end of our regex expressions, not in the middle. I'll update the RegexConfig.txt comments to warn about that limitation. You can change your regex to: IMEI = \b([0-9]{14}[-/][0-9]{1}|[0-9]{14,15}|[0-9]{6}[-/][0-9]{2}[-/][0-9]{6}[-/][0-9]{1})\b

(2) Second, the highlighter was changed some months ago because it was not highlighting regex hits in substrings (#760). We implemented a far from ideal solution to highlight the whole surrounding string as a quick and dirty fix. As you can see, the selected IMEI is a substring into the highlighted sequence. This could be enhanced for sure to highlight just the hit.

(3) Lastly, we have chosen to always highlight substrings, even if substrings weren't searched for explicitly by user (your case since you used word boundaries), we thought it could be helpful, since many users don't know what a word boundary or a substring is... But this applies just to the text highlighter, the returned items/results must have the exact search pattern, so I guess you actually have the selected IMEI string (not the substring into the highlighted sequence) somewhere else in the file which snippet was printed above, right?

lfcnassif avatar Aug 18 '22 17:08 lfcnassif

What other devs think about changing behavior (3) above? Should we stop highlighting substrings if they weren't searched for explicitly by user? If yes, that is not 100% straightforward, since the regex hits found (those listed into the metadata filter panel) keep no information about their original regex pattern (if word boundaries were used or not), maybe picking the selected Regex Name in metadata panel and using its original pattern info from RegexConfig.txt could help.

PS: (2) should be improved for sure.

lfcnassif avatar Aug 18 '22 17:08 lfcnassif

Hi, Actually, when I read the other REGEX items I could see my error in the REGEX syntax for IMEI, including the one that you cited. I saw that I did not consider the group capturing syntax. So I could add the following lines, that returned at least the information I wanted to see (and more, but it's not a problem)

MERCOSUL_CAR_PLATE = \b((([A-Z]{3})(()|-)([0-9]{1})([0-9A-Z]{1})([0-9]{2}))|(Placas?( |:|: )([A-Za-z]{3})( )([0-9]{1})([0-9A-Z]{1})([0-9]{2})))\b IMEI =\b((([0-9]{14})(-|/|())([0-9]{1}))|([0-9]{14})|(([0-9]{6})(-|/)([0-9]{2})(-|/)([0-9]{6})(-|/|())([0-9]{1})))\b

Thank you for the quick response.

Em qui., 18 de ago. de 2022 às 14:34, Luis Filipe Nassif < @.***> escreveu:

Hi,

Unfortunately, \b (word boundary) is supported just at the beginning and at the end of our regex expressions, not in the middle. I'll update the RegexConfig.txt comments to warn about that limitation. You can change your regex to: IMEI = \b([0-9]{14}[-/][0-9]{1}|[0-9]{14,15}|[0-9]{6}[-/][0-9]{2}[-/][0-9]{6}[-/][0-9]{1})\b

Second, the highlighter was changed some months ago because it was not highlighting regex hits in substrings (#760 https://github.com/sepinf-inc/IPED/issues/760). We implemented a far from ideal solution to highlight the whole surrounding string as a quick and dirty fix. As you can see, the selected IMEI is a substring into the highlighted sequence. This could be enhanced for sure to highlight just the hit.

Lastly, we have chosen to always highlight substrings, even if substrings weren't searched for explicitly by user (your case since you used word boundaries), we thought it could be helpful, since many users don't know what a word boundary or a substring is... But this applies just to the text highlighter, the returned items/results must have the exact search pattern, so I guess you actually have the selected IMEI string (not the substring into the highlighted sequence) somewhere else in the file which snipped was printed above, right?

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1280#issuecomment-1219754910, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANAT6JCGDYKGDFH2NIQ76BTVZZXY7ANCNFSM565BZWSA . You are receiving this because you authored the thread.Message ID: @.***>

--


Paulo Adilson da Silva

pauloadilson avatar Aug 19 '22 16:08 pauloadilson

Great, but notice capturing groups and back references (and look ahead) are not supported by the library we use, according to the comments: https://github.com/sepinf-inc/IPED/blob/73c7def868ae9db4a0703c66a21a022179f37f85/iped-app/resources/config/conf/RegexConfig.txt#L8

lfcnassif avatar Aug 19 '22 16:08 lfcnassif

and more, but it's not a problem

If IMEI or PLACA_MERCOSUL have some kind of checksum/validation digit, it is possible to write regex validators (take a look at conf/regex_validators folder) to decrease the number of false positives.

lfcnassif avatar Aug 19 '22 20:08 lfcnassif