re2r icon indicating copy to clipboard operation
re2r copied to clipboard

Track R-GSOC-2016 Progress

Open qinwf opened this issue 8 years ago • 5 comments

qinwf avatar Feb 10 '16 14:02 qinwf

Qin,

Congratulations, as you probably already know, the RE2 project has been accepted!

gagolews avatar Apr 23 '16 08:04 gagolews

Project Status Report

May 19 - 2016

Changes during Community Bonding

1. Setup continuous integration and code coverage test

This package now checks CI on Mac, Linux, and Windows, and the code coverage status is checked by codecov.io.

2. More docs and tests

Add more docs and test cases for new functions and existing functions.

3. Documentation Pages

Initial work on the documentation pages https://qinwf.github.io/re2r_doc/ .

4. Parallel Support

All pattern matching routines have been implemented to work in parallel with RcppParallel.

5. Add split and locate functions

Add split and locate methods for pattern matching.

6. Add regular expression visualization with regexper library

Add show_regex function to visualize RE2 regular expression.

re2 images

7. Improve Performance

Use Google Performance Tools to profile the compiled C++ codes. Rewrite some critical code using raw R-C API to avoid the overhead of Rcpp_PreserveObject and other Rcpp helper functions.

Issue Status

#3 Solaris build

There will be changes now and then. We can test Solaris in the future.

#4 Long Vector Tests

Initial test cases was added.

#5 Match failure when LC_COLLATE is not UTF-8

Use stringi::stri_enc_toutf8 to convert input strings and pattern strings to UTF-8. Changes were landed.

Initial test cases was added.

#6 Question: argument order

Change order from (pattern, string) to (string, pattern) . Changes were landed.

#7 Using SET_STRING_ELT and Rf_mkCharLenCE to handle output string encoding

Changes were landed.

There is one case to take care of. It is that Rcpp exception strings are set to be native encoding instead of UTF-8 encoding, and if a pattern can not be parsed, the error message raised from Rcpp may contain strange characters. To fix it, we can remove Rcpp dependency in the near future.

Now most parts of the code are Rcpp independent, it should be easy to fix.

#8 Handle NA_STRING

All pattern matching routines have been implemented, including match, replace, detect, extract, split, locate, and quote.

Initial test cases was added.

Future Plan

1. Follow the timeline in the proposal

See the proposal.

2. Make functions vectorized

Make functions accept multiple patterns with multiple strings.

3. Add more test cases and close existing issues

Add more tests cases and improve the test coverage ratio.

4. Maybe some new ideas and refine APIs

Thanks for any help and advice!

@gagolews @tdhock

qinwf avatar May 18 '16 14:05 qinwf

You're way ahead the timeline! Theoretically, you should now "Look for examples of how regular expressions are used in existing R packages." :stuck_out_tongue: Congrats!

gagolews avatar May 18 '16 14:05 gagolews

about vectorizing, I think it is mainly necessary to vectorize the subject (not the pattern), since the typical usage is "apply this single regex to this set of subjects"

tdhock avatar May 19 '16 07:05 tdhock

On the other hand, @qinwf could make the API as much similar to stringi (and hence stringr) as possible. Who knows, maybe re2r will some day be wrapped by stringr too..

gagolews avatar May 19 '16 09:05 gagolews