cudf
cudf copied to clipboard
Add regex ASCII flag support for matching builtin character classes
Description
Adds ASCII flag to the libcudf regex_flags
for support with builtin character classes: \w, \W, \s, \S, \d, \D
.
Somewhat equivalent to https://docs.python.org/3/library/re.html#re.ASCII
But strictly the flag modifies matching for these classes as follows:
-
\w
=[a-zA-Z_]
(alphabetic characters plus underline) -
\W
=[^\w]
(basically not\w
) -
\s
=[\t- ]
(tab through space in the ASCII table) -
\S
=[^\s]
(basically not\s
) -
\d
=[0-9]
(digit characters) -
\D
=[^\d]
(basically not\d
)
Additional gtests are included for this flag with these classes. This will be exposed through Python/Cython in a follow up PR.
Closes #10894
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@0df6178
). Click here to learn what that means. The diff coverage isn/a
.
:exclamation: Current head 428bf8b differs from pull request most recent head f20a82e. Consider uploading reports for the commit f20a82e to get more accurate results
@@ Coverage Diff @@
## branch-22.10 #11404 +/- ##
===============================================
Coverage ? 86.35%
===============================================
Files ? 145
Lines ? 22945
Branches ? 0
===============================================
Hits ? 19815
Misses ? 3130
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Does nobody use the regex engine from the C++ side? I am just surprised we don't have any API docs that explain the regex keywords we support.
Spark uses the C++ regex code. We have some good libcudf documentation here: https://docs.rapids.ai/api/libcudf/stable/md_regex.html
Looks like I should update it for this new flag.
@gpucibot merge