[Java] Regex matcher wrongly handles negative lookbehinds
When dealing with negative lookbehinds, the method getMatches in RegExpUtility first looks for matches of the negative lookbehind group and if found, it searches for matches of the following regex group without taking into account its position in the full regex.
So, for example, the regex (?< month >\d{2})/(?<!,\s)(?< year >\d{2}) fails to match the input "So, 12/20" because "12" matches the "year" group regex and it follows a comma (but in the full regex it is a match of the "month" group regex).
Hi @LionbridgeCSII, sorry for the delay. We weren't able to reproduce it using the attached example.
We have tried the RegExpUtility.getMatches method, and also in C#; and in all cases the value 12/20 was matched.
We have some questions:
- Do you have any repro steps for the issue?
- Do you have other examples or cases where this could also happen?
- What version of Java are you using?
The result using Java Recognizers

The result using C# Recognizers

hi @VictorGrycuk, no problem and thanks for the reply.
- To reproduce the issue use for example the Java SimpleConsole (no need to define a new regex, the problem already occurs because of the negative lookbehind in DateYearRegex). When entering the text "So 12/20" two entities are extracted, a number and a date, but when entering the text "So, 12/20" only the number entity is extracted.
- Another example is "I'll be back at 3:32, 04/23/2016", without comma "3:32 04/23/2016" is extracted as a single datetime entity, with the comma instead only the time "3:32" and the year "2016" are extracted (in .Net "3:32, 04/23/2016" is still extracted as datetime).
- I am using openjdk version 15.
When I was examining the issue, I traced the problem to line 125 of RegExpUtility, where the regex group following a negative lookbehind is matched on the input string without taking into account its position in the whole regex. A possible solution to the problem could maybe be to let Java handle the regex (since recent versions support lookbehinds, the machinery introduced in RegExpUtility to deal with them seems unnecessary).
Hi @VictorGrycuk and @LionbridgeCSII, the ideal solution would be to have it work in any Java version from 8 onwards. There's another issue about the code that decides to trigger some mitigation or use newer native support, which could apply here. Also, as there never was an official release of the Java packages, we may decide to drop support for older Java versions. @VictorGrycuk are you aware which Java versions the BotFramework SDK targets? We can follow that. In any case, we should try to validate that whatever solution here works for different Java SDKs for a given version, if possible.
@LionbridgeCSII, we used the examples you attached following your repro steps, and the examples were correctly recognized as date.
Using the "So, 12/20" example, in C# and Java, the 12 is recognized as day and 20 is recognized as month.
We tested this in Java 11 and 15, and also in C#. Using Java 8 throws an exception related to the issue #1786.
@tellarin, BotBuilder-Java SDK requires Java 8.
Testing So, 12/20 (Left: with comma; right: without comma)

Testing I'll be back at 3:32, 04/23/2016 (Left: with comma; right: without comma)

@VictorGrycuk, can you take a look at PR #2383? I believe the relevant spec cases were marked as
"Comment": "Java does not correctly handle lookbehinds.",
In English and Spanish. The issue seems related to that fallback regex processing method that exists for Java 8 (and is currently also being run for other versions).
@VictorGrycuk, as BotBuilder requires at least Java 8, the recognizers should also support it, plus 11 (which is in LTS), and possibly 12 - 15. Thanks for checking.
@LionbridgeCSII, we used the examples you attached following your repro steps, and the examples were correctly recognized as date.
@VictorGrycuk, interesting, it seems there is something wrong on my side then. Thanks for checking.
@LionbridgeCSII Thanks for the update. If you are okay, you can close this issue, but feel free to re-open it if you need further assistance or open a new issue.
@VictorGrycuk, as I mentioned above, please re-enable the Java tests disabled by @LionbridgeCSII in PR #2383 as related to this problem to confirm all works correctly. If a PR with those can pass the build, that’s the first step before closing. Also, we can’t close this issue until there is a fix that works for Java 8 too.
Ok @tellarin, we will be reviewing the PR #2383 in order to confirm that everything is working as expected. We will let you know the updates in order to close the issue!
Hi @tellarin, sorry for the delay. We re-enabled the specs that were disabled in the PR #2383 and all of them are failing only in Java with the current version of master branch.
We identified the RegEx (DateExtractor4) which wasn't matching with the date 04/23/2016 that was present in those specs.
The RegEx is composed by:
- MonthNumRegex
- DayRegex
- DateYearRegex: recently modified in the PR #2430 (see the changes for the Spanish file and for the English File).
We tested rollbacking those changes in the DateYearRegex, also we brought the changes made in the PR #2439, and all the tests passed successfully for both cultures and using Java 8, Java 11 and also Java 15.
We will analyze those changes and if it's really necessary to rollback the changes and why, reviewing the PR #2430 🙂
Tests passing using Java 8/11/15
