re2r
re2r copied to clipboard
Track R-GSOC-2016 Progress
Qin,
Congratulations, as you probably already know, the RE2 project has been accepted!
Project Status Report
May 19 - 2016
Changes during Community Bonding
1. Setup continuous integration and code coverage test
This package now checks CI on Mac, Linux, and Windows, and the code coverage status is checked by codecov.io.
2. More docs and tests
Add more docs and test cases for new functions and existing functions.
3. Documentation Pages
Initial work on the documentation pages https://qinwf.github.io/re2r_doc/ .
4. Parallel Support
All pattern matching routines have been implemented to work in parallel with RcppParallel.
5. Add split and locate functions
Add split
and locate
methods for pattern matching.
6. Add regular expression visualization with regexper library
Add show_regex
function to visualize RE2 regular expression.
7. Improve Performance
Use Google Performance Tools to profile the compiled C++ codes. Rewrite some critical code using raw R-C API to avoid the overhead of Rcpp_PreserveObject
and other Rcpp helper functions.
Issue Status
#3 Solaris build
There will be changes now and then. We can test Solaris in the future.
#4 Long Vector Tests
Initial test cases was added.
#5 Match failure when LC_COLLATE is not UTF-8
Use stringi::stri_enc_toutf8
to convert input strings and pattern strings to UTF-8. Changes were landed.
Initial test cases was added.
#6 Question: argument order
Change order from (pattern, string)
to (string, pattern)
. Changes were landed.
#7 Using SET_STRING_ELT and Rf_mkCharLenCE to handle output string encoding
Changes were landed.
There is one case to take care of. It is that Rcpp exception strings are set to be native encoding instead of UTF-8 encoding, and if a pattern can not be parsed, the error message raised from Rcpp may contain strange characters. To fix it, we can remove Rcpp dependency in the near future.
Now most parts of the code are Rcpp independent, it should be easy to fix.
#8 Handle NA_STRING
All pattern matching routines have been implemented, including match
, replace
, detect
, extract
, split
, locate
, and quote
.
Initial test cases was added.
Future Plan
1. Follow the timeline in the proposal
See the proposal.
2. Make functions vectorized
Make functions accept multiple patterns with multiple strings.
3. Add more test cases and close existing issues
Add more tests cases and improve the test coverage ratio.
4. Maybe some new ideas and refine APIs
Thanks for any help and advice!
@gagolews @tdhock
You're way ahead the timeline! Theoretically, you should now "Look for examples of how regular expressions are used in existing R packages." :stuck_out_tongue: Congrats!
about vectorizing, I think it is mainly necessary to vectorize the subject (not the pattern), since the typical usage is "apply this single regex to this set of subjects"
On the other hand, @qinwf could make the API as much similar to stringi (and hence stringr) as possible. Who knows, maybe re2r will some day be wrapped by stringr too..