re2j icon indicating copy to clipboard operation
re2j copied to clipboard

Fixed word boundary patterns: ...

Open mykeul opened this issue 5 years ago • 11 comments

... they were ascii-only while java.util.regexp is unicode compliant

mykeul avatar Dec 05 '19 14:12 mykeul

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

googlebot avatar Dec 05 '19 14:12 googlebot

@googlebot I signed it!

mykeul avatar Dec 05 '19 15:12 mykeul

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

googlebot avatar Dec 05 '19 15:12 googlebot

Pull Request Test Coverage Report for Build 281

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 15 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.09%) to 93.957%

Files with Coverage Reduction New Missed Lines %
/home/travis/build/google/re2j/java/com/google/re2j/Regexp.java 1 94.58%
/home/travis/build/google/re2j/java/com/google/re2j/Simplify.java 1 93.75%
/home/travis/build/google/re2j/java/com/google/re2j/Unicode.java 6 84.31%
/home/travis/build/google/re2j/java/com/google/re2j/Parser.java 7 94.51%
<!-- Total: 15
Totals Coverage Status
Change from base Build 280: -0.09%
Covered Lines: 2985
Relevant Lines: 3177

💛 - Coveralls

coveralls avatar Dec 05 '19 16:12 coveralls

Sorry, I had to @Ignore a large test I don't know how to fix.

mykeul avatar Dec 05 '19 16:12 mykeul

RE2/J tries to implement RE2's syntax, which specifies that \b match ASCII word boundaries only. I think this behavior is working as intended.

sjamesr avatar Jun 22 '20 02:06 sjamesr

I still do not understand why such PR making life easier for non-US devs is refused : this just makes transitions to re2j easier in many more use-cases, by reducing differences with jdk's Regexp. For people like me who needs both https://github.com/google/re2j/pull/97 and not just "ascii regexps", here is the hack I'm using for more than one year (just need to call one of the static patch() method depending on the app context but asap, just with javassist in classpath) : https://gist.github.com/mykeul/2d69f72bcb73ff7a72f167a86329d0a8

mykeul avatar Jun 15 '21 17:06 mykeul

I was clear as to why I refused the PR: because it caused RE2J to differ from RE2 in behavior. It has nothing to do with making life more difficult for non-US devs.

The question should be whether RE2J should try sticking to RE2 syntax and semantics, or whether it should favor similarity to java.util.regex when possible. I'm inclined to think that RE2J should behave more like java.util.regex, since that's what Java developers will most likely be comparing it to.

This PR probably breaks tests because the tests are derived from Go's regexp package, which implements RE2's ASCII-only word boundaries. I'll see what is necessary to fix the tests.

sjamesr avatar Jun 15 '21 18:06 sjamesr

Codecov Report

Merging #100 (e49d00b) into master (b3bce7b) will decrease coverage by 0.22%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
- Coverage   89.25%   89.02%   -0.23%     
==========================================
  Files          19       19              
  Lines        3025     3026       +1     
  Branches      607      607              
==========================================
- Hits         2700     2694       -6     
- Misses        187      190       +3     
- Partials      138      142       +4     
Impacted Files Coverage Δ
java/com/google/re2j/Unicode.java 81.08% <100.00%> (+17.19%) :arrow_up:
java/com/google/re2j/Utils.java 83.33% <100.00%> (ø)
java/com/google/re2j/Simplify.java 84.37% <0.00%> (-6.25%) :arrow_down:
java/com/google/re2j/Regexp.java 90.64% <0.00%> (-0.99%) :arrow_down:
java/com/google/re2j/Parser.java 87.21% <0.00%> (-0.89%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b3bce7b...e49d00b. Read the comment docs.

codecov-commenter avatar Jun 15 '21 18:06 codecov-commenter

Hi James, many thanks, (Sorry, I'm french, I'm not an exception to the world wide known "préjugé" : my english is bad, my terms may be inapropriated) I got the point. Was just trying to share my experiments, my eventual fixes, and potential workarounds for this not accepted PR. Maybe RE2 would benefit from this too ? I came to re2j for the potential longest matches that seemed impossible with java.util.regex, and proposed https://github.com/google/re2j/pull/97 and a few accepeted optimisations. I need this gist in multiple projects now, just tried to share it, but really don't want to fork to an "unicode-re2j". I'll try this/an evening (UTC+2) to merge current code into the initial PR to make an eventual merge easier. Best regards, Mickaël (please note the "ë" ;-) )

mykeul avatar Jun 16 '21 12:06 mykeul

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

google-cla[bot] avatar Jun 17 '21 17:06 google-cla[bot]