datawave
datawave copied to clipboard
Numeric regex normalization could be made more robust
The numeric regex normalization currently will attempt a normal regex normalization against the value, and will use as a quoted literal regex (e.g. \Q
I propose that we can clean this up a little and enable normalization of numbers as long as we know the complete mantissa. Also we should assume that '.' must be escaped in the regex to have something that matches the decimal point. So here are some examples:
(FIELD =~ '3.2') should fail numeric normalization (currently this succeeds incorrectly)
(FIELD =~ '3\.2') should become (FIELD =~ '\+ae3\.2')
(FIELD =~ '32\.2.*') should become (FIELD =~ '\+be3\.22.*')
PR created here: https://github.com/NationalSecurityAgency/datawave-type-utils/pull/11
Some regex forms that I am seeing:
1111.*
1111.*?
1111\d*
.*?1111
.*1111
.*1111.*
^11[23].*
1111[0-9]{5}
.*1111\..*
Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next.
I see many combinations of the above. I have also seen .+ instead of .*. Also have seen a series of the patterns above all in parenthesis separated by | characters.