cudf
cudf copied to clipboard
Add read-only functions on string dtypes to `DataFrame.apply` and `Series.apply`
This PR provides initial support for string data inside UDFs passed to DataFrame.apply
and Series.apply
. The allowed APIs are based on python's str
class. It aims to implement python string semantics as closely as possible starting with APIs that return numeric data only. These are the following 21 functions:
-
str.count
-
str.startswith
-
str.endswith
-
str.find
-
str.rfind
-
str.isalnum
-
str.isdecimal
-
str.isdigit
-
str.islower
-
str.isupper
-
str.isalpha
-
str.istitle
-
str.isspace
-
==
,!=
,>=
,<=
,>
,<
(between two strings) -
len
-
__contains__
The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts)
-
str.isascii
-
str.isidentifier
-
str.isprintable
This works by creating a library of __device__
functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those __device__
functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input.
cc @davidwendt @gmarkall
rerun tests
rerun tests
rerun tests
rerun tests
This should be ready for review.
Part way through reviewing I realized that there are a number of files in the strings_udf/cpp directory that are just copied from libcudf. Where do we run into problems with using the original files directly?
This should not be the case. I created several libcudf PRs to move functions around to avoid duplication. The last one just merged a few days ago which means we can replace the strip_utils.cuh
here with the one from https://github.com/rapidsai/strings_udf/blob/main/cpp/include/cudf/strings/detail/strip_utils.cuh
Also, I mentioned in a different comment that the integers.cuh
probably can be deleted since it is not used in this PR.
If you find any others, we should open an issue to fix them in follow up PRs.
Codecov Report
Base: 86.39% // Head: 85.97% // Decreases project coverage by -0.42%
:warning:
Coverage data is based on head (
036eeec
) compared to base (0ba4675
). Patch coverage: 66.86% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11319 +/- ##
================================================
- Coverage 86.39% 85.97% -0.43%
================================================
Files 145 151 +6
Lines 23021 23477 +456
================================================
+ Hits 19890 20185 +295
- Misses 3131 3292 +161
Impacted Files | Coverage Δ | |
---|---|---|
python/cudf/cudf/core/udf/strings_lowering.py | 0.00% <0.00%> (ø) |
|
python/cudf/cudf/core/udf/strings_typing.py | 0.00% <0.00%> (ø) |
|
python/cudf/cudf/core/udf/__init__.py | 50.00% <50.00%> (-50.00%) |
:arrow_down: |
python/strings_udf/strings_udf/lowering.py | 84.67% <84.67%> (ø) |
|
python/cudf/cudf/core/udf/masked_typing.py | 96.52% <91.17%> (ø) |
|
python/cudf/cudf/core/udf/utils.py | 97.00% <94.11%> (-1.64%) |
:arrow_down: |
python/strings_udf/strings_udf/_typing.py | 95.78% <95.78%> (ø) |
|
python/strings_udf/strings_udf/__init__.py | 96.87% <96.87%> (ø) |
|
python/cudf/cudf/core/indexed_frame.py | 92.12% <100.00%> (-0.07%) |
:arrow_down: |
python/cudf/cudf/core/udf/masked_lowering.py | 100.00% <100.00%> (ø) |
|
... and 9 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
rerun tests
Can you remerge with 22.10? There are bunch of extra files in the PR.
Can you remerge with 22.10? There are bunch of extra files in the PR.
This should be resolved with the change of base.
@gpucibot merge