re2j
re2j copied to clipboard
Fixed word boundary patterns: ...
... they were ascii-only while java.util.regexp is unicode compliant
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
:memo: Please visit https://cla.developers.google.com/ to sign.
Once you've signed (or fixed any issues), please reply here with @googlebot I signed it!
and we'll verify it.
What to do if you already signed the CLA
Individual signers
- It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
Corporate signers
- Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
- The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
- The email used to register you as an authorized contributor must also be attached to your GitHub account.
ℹ️ Googlers: Go here for more info.
@googlebot I signed it!
Pull Request Test Coverage Report for Build 281
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 15 unchanged lines in 4 files lost coverage.
- Overall coverage decreased (-0.09%) to 93.957%
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
/home/travis/build/google/re2j/java/com/google/re2j/Regexp.java | 1 | 94.58% |
/home/travis/build/google/re2j/java/com/google/re2j/Simplify.java | 1 | 93.75% |
/home/travis/build/google/re2j/java/com/google/re2j/Unicode.java | 6 | 84.31% |
/home/travis/build/google/re2j/java/com/google/re2j/Parser.java | 7 | 94.51% |
<!-- | Total: | 15 |
Totals | |
---|---|
Change from base Build 280: | -0.09% |
Covered Lines: | 2985 |
Relevant Lines: | 3177 |
💛 - Coveralls
Sorry, I had to @Ignore a large test I don't know how to fix.
RE2/J tries to implement RE2's syntax, which specifies that \b
match ASCII word boundaries only. I think this behavior is working as intended.
I still do not understand why such PR making life easier for non-US devs is refused : this just makes transitions to re2j easier in many more use-cases, by reducing differences with jdk's Regexp. For people like me who needs both https://github.com/google/re2j/pull/97 and not just "ascii regexps", here is the hack I'm using for more than one year (just need to call one of the static patch() method depending on the app context but asap, just with javassist in classpath) : https://gist.github.com/mykeul/2d69f72bcb73ff7a72f167a86329d0a8
I was clear as to why I refused the PR: because it caused RE2J to differ from RE2 in behavior. It has nothing to do with making life more difficult for non-US devs.
The question should be whether RE2J should try sticking to RE2 syntax and semantics, or whether it should favor similarity to java.util.regex when possible. I'm inclined to think that RE2J should behave more like java.util.regex, since that's what Java developers will most likely be comparing it to.
This PR probably breaks tests because the tests are derived from Go's regexp package, which implements RE2's ASCII-only word boundaries. I'll see what is necessary to fix the tests.
Codecov Report
Merging #100 (e49d00b) into master (b3bce7b) will decrease coverage by
0.22%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #100 +/- ##
==========================================
- Coverage 89.25% 89.02% -0.23%
==========================================
Files 19 19
Lines 3025 3026 +1
Branches 607 607
==========================================
- Hits 2700 2694 -6
- Misses 187 190 +3
- Partials 138 142 +4
Impacted Files | Coverage Δ | |
---|---|---|
java/com/google/re2j/Unicode.java | 81.08% <100.00%> (+17.19%) |
:arrow_up: |
java/com/google/re2j/Utils.java | 83.33% <100.00%> (ø) |
|
java/com/google/re2j/Simplify.java | 84.37% <0.00%> (-6.25%) |
:arrow_down: |
java/com/google/re2j/Regexp.java | 90.64% <0.00%> (-0.99%) |
:arrow_down: |
java/com/google/re2j/Parser.java | 87.21% <0.00%> (-0.89%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b3bce7b...e49d00b. Read the comment docs.
Hi James, many thanks, (Sorry, I'm french, I'm not an exception to the world wide known "préjugé" : my english is bad, my terms may be inapropriated) I got the point. Was just trying to share my experiments, my eventual fixes, and potential workarounds for this not accepted PR. Maybe RE2 would benefit from this too ? I came to re2j for the potential longest matches that seemed impossible with java.util.regex, and proposed https://github.com/google/re2j/pull/97 and a few accepeted optimisations. I need this gist in multiple projects now, just tried to share it, but really don't want to fork to an "unicode-re2j". I'll try this/an evening (UTC+2) to merge current code into the initial PR to make an eventual merge easier. Best regards, Mickaël (please note the "ë" ;-) )
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.
. If the bot doesn't comment, it means it doesn't think anything has changed.
ℹ️ Googlers: Go here for more info.