datawave icon indicating copy to clipboard operation
datawave copied to clipboard

Numeric regex normalization could be made more robust

Open ivakegg opened this issue 2 years ago • 1 comments

The numeric regex normalization currently will attempt a normal regex normalization against the value, and will use as a quoted literal regex (e.g. \Q\E). Hence if there are any wildcards in there then the normalization will fail and the regex will become evaluation only (see #1558). Also if you have a numeric regex such as '3.2', then the current normalizer will produce '\Q+ae3.2\E' which assumes that the '.' wildcard was a literal. This is not quite correct.

I propose that we can clean this up a little and enable normalization of numbers as long as we know the complete mantissa. Also we should assume that '.' must be escaped in the regex to have something that matches the decimal point. So here are some examples:

(FIELD =~ '3.2')   should fail numeric normalization (currently this succeeds incorrectly)
(FIELD =~ '3\.2') should become (FIELD =~ '\+ae3\.2')
(FIELD =~ '32\.2.*') should become (FIELD =~ '\+be3\.22.*')

ivakegg avatar May 19 '22 17:05 ivakegg

PR created here: https://github.com/NationalSecurityAgency/datawave-type-utils/pull/11

lbschanno avatar Aug 30 '22 15:08 lbschanno

Some regex forms that I am seeing:

1111.*
1111.*?
1111\d*
.*?1111
.*1111
.*1111.*
^11[23].*
1111[0-9]{5}
.*1111\..*

Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next.

ivakegg avatar Oct 26 '22 13:10 ivakegg

I see many combinations of the above. I have also seen .+ instead of .*. Also have seen a series of the patterns above all in parenthesis separated by | characters.

ivakegg avatar Oct 31 '22 15:10 ivakegg