Recognizers-Text icon indicating copy to clipboard operation
Recognizers-Text copied to clipboard

[Java] Regex matcher wrongly handles negative lookbehinds

Open aitelint opened this issue 5 years ago • 11 comments

When dealing with negative lookbehinds, the method getMatches in RegExpUtility first looks for matches of the negative lookbehind group and if found, it searches for matches of the following regex group without taking into account its position in the full regex.

So, for example, the regex (?< month >\d{2})/(?<!,\s)(?< year >\d{2}) fails to match the input "So, 12/20" because "12" matches the "year" group regex and it follows a comma (but in the full regex it is a match of the "month" group regex).

aitelint avatar Dec 16 '20 15:12 aitelint

Hi @LionbridgeCSII, sorry for the delay. We weren't able to reproduce it using the attached example.

We have tried the RegExpUtility.getMatches method, and also in C#; and in all cases the value 12/20 was matched.

We have some questions:

  1. Do you have any repro steps for the issue?
  2. Do you have other examples or cases where this could also happen?
  3. What version of Java are you using?

The result using Java Recognizers image

The result using C# Recognizers image

VictorGrycuk avatar Jan 18 '21 19:01 VictorGrycuk

hi @VictorGrycuk, no problem and thanks for the reply.

  1. To reproduce the issue use for example the Java SimpleConsole (no need to define a new regex, the problem already occurs because of the negative lookbehind in DateYearRegex). When entering the text "So 12/20" two entities are extracted, a number and a date, but when entering the text "So, 12/20" only the number entity is extracted.
  2. Another example is "I'll be back at 3:32, 04/23/2016", without comma "3:32 04/23/2016" is extracted as a single datetime entity, with the comma instead only the time "3:32" and the year "2016" are extracted (in .Net "3:32, 04/23/2016" is still extracted as datetime).
  3. I am using openjdk version 15.

When I was examining the issue, I traced the problem to line 125 of RegExpUtility, where the regex group following a negative lookbehind is matched on the input string without taking into account its position in the whole regex. A possible solution to the problem could maybe be to let Java handle the regex (since recent versions support lookbehinds, the machinery introduced in RegExpUtility to deal with them seems unnecessary).

aitelint avatar Jan 19 '21 04:01 aitelint

Hi @VictorGrycuk and @LionbridgeCSII, the ideal solution would be to have it work in any Java version from 8 onwards. There's another issue about the code that decides to trigger some mitigation or use newer native support, which could apply here. Also, as there never was an official release of the Java packages, we may decide to drop support for older Java versions. @VictorGrycuk are you aware which Java versions the BotFramework SDK targets? We can follow that. In any case, we should try to validate that whatever solution here works for different Java SDKs for a given version, if possible.

tellarin avatar Jan 19 '21 04:01 tellarin

@LionbridgeCSII, we used the examples you attached following your repro steps, and the examples were correctly recognized as date.

Using the "So, 12/20" example, in C# and Java, the 12 is recognized as day and 20 is recognized as month.

We tested this in Java 11 and 15, and also in C#. Using Java 8 throws an exception related to the issue #1786.

@tellarin, BotBuilder-Java SDK requires Java 8.

Testing So, 12/20 (Left: with comma; right: without comma) image image

Testing I'll be back at 3:32, 04/23/2016 (Left: with comma; right: without comma) image image

VictorGrycuk avatar Jan 19 '21 20:01 VictorGrycuk

@VictorGrycuk, can you take a look at PR #2383? I believe the relevant spec cases were marked as

"Comment": "Java does not correctly handle lookbehinds.",

In English and Spanish. The issue seems related to that fallback regex processing method that exists for Java 8 (and is currently also being run for other versions).

tellarin avatar Jan 20 '21 02:01 tellarin

@VictorGrycuk, as BotBuilder requires at least Java 8, the recognizers should also support it, plus 11 (which is in LTS), and possibly 12 - 15. Thanks for checking.

tellarin avatar Jan 20 '21 02:01 tellarin

@LionbridgeCSII, we used the examples you attached following your repro steps, and the examples were correctly recognized as date.

@VictorGrycuk, interesting, it seems there is something wrong on my side then. Thanks for checking.

aitelint avatar Jan 20 '21 07:01 aitelint

@LionbridgeCSII Thanks for the update. If you are okay, you can close this issue, but feel free to re-open it if you need further assistance or open a new issue.

VictorGrycuk avatar Jan 20 '21 12:01 VictorGrycuk

@VictorGrycuk, as I mentioned above, please re-enable the Java tests disabled by @LionbridgeCSII in PR #2383 as related to this problem to confirm all works correctly. If a PR with those can pass the build, that’s the first step before closing. Also, we can’t close this issue until there is a fix that works for Java 8 too.

tellarin avatar Jan 20 '21 19:01 tellarin

Ok @tellarin, we will be reviewing the PR #2383 in order to confirm that everything is working as expected. We will let you know the updates in order to close the issue!

VictorGrycuk avatar Jan 20 '21 19:01 VictorGrycuk

Hi @tellarin, sorry for the delay. We re-enabled the specs that were disabled in the PR #2383 and all of them are failing only in Java with the current version of master branch.

We identified the RegEx (DateExtractor4) which wasn't matching with the date 04/23/2016 that was present in those specs.

The RegEx is composed by:

We tested rollbacking those changes in the DateYearRegex, also we brought the changes made in the PR #2439, and all the tests passed successfully for both cultures and using Java 8, Java 11 and also Java 15.

We will analyze those changes and if it's really necessary to rollback the changes and why, reviewing the PR #2430 🙂

Tests passing using Java 8/11/15 image

VictorGrycuk avatar Jan 29 '21 19:01 VictorGrycuk