cudf Fix large strings handling in nvtext::character

Fix large strings handling in nvtext::character_tokenize

Open davidwendt opened this issue 9 months ago • 0 comments

Description

Fix logic for nvtext::character_tokenize to handle large strings input. The output for > 2GB input strings column will turn characters into rows and so will likely overflow the size_type rows as expected. The thrust::count_if is replaced with a raw kernel to produce the appropriate count that can be checked against max row size. Also changed the API to not accept null rows since the code does not check for them and can return invalid results for inputs with unsanitized-null rows.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

May 22 '24 21:05 davidwendt

cudf cudf copied to clipboard

Fix large strings handling in nvtext::character_tokenize

Description

Checklist

cudf
cudf copied to clipboard