re2j icon indicating copy to clipboard operation
re2j copied to clipboard

\b does not behave like it does with java.util.regex.Pattern

Open mykeul opened this issue 4 years ago • 6 comments

Word boundaries should use \p{L} not just A-Za-z to behave like the default regex in java. Added some tests showing the issue and fixed it in this PR : https://github.com/google/re2j/pull/100 . (but I had to disable a large unit-test I don't know how to adapt to support this change)

mykeul avatar Jun 10 '20 15:06 mykeul

Seen that PR #100 was closed, but java.util.regex.Pattern behaves like it should (lets say with french word "été", this is a real word so word boundaries should match, shouldn't they ?), but current re2j doesn't match them, wiki page should be updated to reflect this, not be used to refuse improvements : PR #100 should be applyed (the new unit-tests show the behaviour mismatches)

mykeul avatar Jun 23 '20 12:06 mykeul

As we describe in the package documentation, RE2/J implements the behavior specified by https://github.com/google/re2/wiki/Syntax. As noted on the github page, RE2J is not a drop-in replacement for java.util.regex.Pattern for this and a host of other reasons.

I raised https://groups.google.com/u/1/g/re2-dev/c/nyGkxcJKExY with re2-dev to see why RE2 does not implement word boundary matching in this way.

I'm not in a position to document every way in which RE2/J differs from java.util.regexp. Some of the differences are noted on the github page, others will be described in the RE2 syntax document (e.g. \b unambiguously implements ASCII word boundary matching, this is different from java.util.regexp).

sjamesr avatar Jun 26 '20 15:06 sjamesr

I guess/hope the wiki page documents what the code do, But maybe should not limit to what it should do, Imho the page should be updated. Word boudaries with french words but many other languages are mandatory for my usage, and accents are part of it, I moved to re2j because I need longuest matches and this was easier to implement/patch it than with java's regexp. Why not try to make both libraries almost equivalents, why not make re2j the optimal regexp library with both functionalities ? (US users will not notice, this is the the same "fight" again and again : ascii vs unicode, the same fight that brought us to "code pages" that we both like, I hope, to forget forever)

mykeul avatar Jun 29 '20 15:06 mykeul

Leaved a comment on the closed original PR : a hack for people wanting the PR, I always need it :-/

mykeul avatar Jun 15 '21 17:06 mykeul

We ran into this problem in Velox when matching German strings:

SELECT REGEXP_LIKE('Insidern auch als Grenzenüberschreiter bekannt', '(?i)(\b)Grenzen(\b)')

This query returns 'true' while we (and our users) expect 'false'.

Is there any workaround?

CC: @zacw7

mbasmanova avatar Jan 23 '24 01:01 mbasmanova

Same issue exists in RE2 as well: https://github.com/google/re2/issues/344

mbasmanova avatar Jan 23 '24 01:01 mbasmanova