datawave Numeric regex normalization could be made more robust

Numeric regex normalization could be made more robust

Open ivakegg opened this issue 2 years ago • 1 comments

The numeric regex normalization currently will attempt a normal regex normalization against the value, and will use as a quoted literal regex (e.g. \Q\E). Hence if there are any wildcards in there then the normalization will fail and the regex will become evaluation only (see #1558). Also if you have a numeric regex such as '3.2', then the current normalizer will produce '\Q+ae3.2\E' which assumes that the '.' wildcard was a literal. This is not quite correct.

I propose that we can clean this up a little and enable normalization of numbers as long as we know the complete mantissa. Also we should assume that '.' must be escaped in the regex to have something that matches the decimal point. So here are some examples:

(FIELD =~ '3.2')   should fail numeric normalization (currently this succeeds incorrectly)
(FIELD =~ '3\.2') should become (FIELD =~ '\+ae3\.2')
(FIELD =~ '32\.2.*') should become (FIELD =~ '\+be3\.22.*')

May 19 '22 17:05 ivakegg

PR created here: https://github.com/NationalSecurityAgency/datawave-type-utils/pull/11

Aug 30 '22 15:08 lbschanno

Some regex forms that I am seeing:

1111.*
1111.*?
1111\d*
.*?1111
.*1111
.*1111.*
^11[23].*
1111[0-9]{5}
.*1111\..*

Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next.

Oct 26 '22 13:10 ivakegg

I see many combinations of the above. I have also seen .+ instead of .*. Also have seen a series of the patterns above all in parenthesis separated by | characters.

Oct 31 '22 15:10 ivakegg

datawave datawave copied to clipboard

Numeric regex normalization could be made more robust

datawave
datawave copied to clipboard