cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add read-only functions on string dtypes to `DataFrame.apply` and `Series.apply`

Open brandon-b-miller opened this issue 2 years ago • 3 comments

This PR provides initial support for string data inside UDFs passed to DataFrame.apply and Series.apply. The allowed APIs are based on python's str class. It aims to implement python string semantics as closely as possible starting with APIs that return numeric data only. These are the following 21 functions:

  • str.count
  • str.startswith
  • str.endswith
  • str.find
  • str.rfind
  • str.isalnum
  • str.isdecimal
  • str.isdigit
  • str.islower
  • str.isupper
  • str.isalpha
  • str.istitle
  • str.isspace
  • ==, !=, >=, <=, >, < (between two strings)
  • len
  • __contains__

The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts)

  • str.isascii
  • str.isidentifier
  • str.isprintable

This works by creating a library of __device__ functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those __device__ functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input.

cc @davidwendt @gmarkall

brandon-b-miller avatar Jul 21 '22 13:07 brandon-b-miller

rerun tests

brandon-b-miller avatar Aug 08 '22 18:08 brandon-b-miller

rerun tests

brandon-b-miller avatar Aug 08 '22 19:08 brandon-b-miller

rerun tests

raydouglass avatar Aug 08 '22 19:08 raydouglass

rerun tests

brandon-b-miller avatar Aug 22 '22 19:08 brandon-b-miller

This should be ready for review.

brandon-b-miller avatar Sep 02 '22 18:09 brandon-b-miller

Part way through reviewing I realized that there are a number of files in the strings_udf/cpp directory that are just copied from libcudf. Where do we run into problems with using the original files directly?

This should not be the case. I created several libcudf PRs to move functions around to avoid duplication. The last one just merged a few days ago which means we can replace the strip_utils.cuh here with the one from https://github.com/rapidsai/strings_udf/blob/main/cpp/include/cudf/strings/detail/strip_utils.cuh

Also, I mentioned in a different comment that the integers.cuh probably can be deleted since it is not used in this PR.

If you find any others, we should open an issue to fix them in follow up PRs.

davidwendt avatar Sep 09 '22 02:09 davidwendt

Codecov Report

Base: 86.39% // Head: 85.97% // Decreases project coverage by -0.42% :warning:

Coverage data is based on head (036eeec) compared to base (0ba4675). Patch coverage: 66.86% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10   #11319      +/-   ##
================================================
- Coverage         86.39%   85.97%   -0.43%     
================================================
  Files               145      151       +6     
  Lines             23021    23477     +456     
================================================
+ Hits              19890    20185     +295     
- Misses             3131     3292     +161     
Impacted Files Coverage Δ
python/cudf/cudf/core/udf/strings_lowering.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/udf/strings_typing.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/udf/__init__.py 50.00% <50.00%> (-50.00%) :arrow_down:
python/strings_udf/strings_udf/lowering.py 84.67% <84.67%> (ø)
python/cudf/cudf/core/udf/masked_typing.py 96.52% <91.17%> (ø)
python/cudf/cudf/core/udf/utils.py 97.00% <94.11%> (-1.64%) :arrow_down:
python/strings_udf/strings_udf/_typing.py 95.78% <95.78%> (ø)
python/strings_udf/strings_udf/__init__.py 96.87% <96.87%> (ø)
python/cudf/cudf/core/indexed_frame.py 92.12% <100.00%> (-0.07%) :arrow_down:
python/cudf/cudf/core/udf/masked_lowering.py 100.00% <100.00%> (ø)
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Sep 13 '22 00:09 codecov[bot]

rerun tests

brandon-b-miller avatar Sep 14 '22 23:09 brandon-b-miller

Can you remerge with 22.10? There are bunch of extra files in the PR.

davidwendt avatar Sep 20 '22 14:09 davidwendt

Can you remerge with 22.10? There are bunch of extra files in the PR.

This should be resolved with the change of base.

brandon-b-miller avatar Sep 20 '22 14:09 brandon-b-miller

@gpucibot merge

brandon-b-miller avatar Sep 20 '22 18:09 brandon-b-miller