cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add regex ASCII flag support for matching builtin character classes

Open davidwendt opened this issue 2 years ago • 1 comments

Description

Adds ASCII flag to the libcudf regex_flags for support with builtin character classes: \w, \W, \s, \S, \d, \D. Somewhat equivalent to https://docs.python.org/3/library/re.html#re.ASCII But strictly the flag modifies matching for these classes as follows:

  • \w = [a-zA-Z_] (alphabetic characters plus underline)
  • \W = [^\w] (basically not \w)
  • \s = [\t- ] (tab through space in the ASCII table)
  • \S = [^\s] (basically not \s)
  • \d = [0-9] (digit characters)
  • \D = [^\d] (basically not \d)

Additional gtests are included for this flag with these classes. This will be exposed through Python/Cython in a follow up PR.

Closes #10894

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

davidwendt avatar Jul 29 '22 19:07 davidwendt

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@0df6178). Click here to learn what that means. The diff coverage is n/a.

:exclamation: Current head 428bf8b differs from pull request most recent head f20a82e. Consider uploading reports for the commit f20a82e to get more accurate results

@@               Coverage Diff               @@
##             branch-22.10   #11404   +/-   ##
===============================================
  Coverage                ?   86.35%           
===============================================
  Files                   ?      145           
  Lines                   ?    22945           
  Branches                ?        0           
===============================================
  Hits                    ?    19815           
  Misses                  ?     3130           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 01 '22 23:08 codecov[bot]

Does nobody use the regex engine from the C++ side? I am just surprised we don't have any API docs that explain the regex keywords we support.

Spark uses the C++ regex code. We have some good libcudf documentation here: https://docs.rapids.ai/api/libcudf/stable/md_regex.html

Looks like I should update it for this new flag.

davidwendt avatar Aug 12 '22 13:08 davidwendt

@gpucibot merge

davidwendt avatar Aug 15 '22 13:08 davidwendt