cudf
cudf copied to clipboard
Fix large strings handling in nvtext::character_tokenize
Description
Fix logic for nvtext::character_tokenize
to handle large strings input. The output for > 2GB input strings column will turn characters into rows and so will likely overflow the size_type
rows as expected. The thrust::count_if
is replaced with a raw kernel to produce the appropriate count that can be checked against max row size.
Also changed the API to not accept null rows since the code does not check for them and can return invalid results for inputs with unsanitized-null rows.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.