cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Fix large strings handling in nvtext::character_tokenize

Open davidwendt opened this issue 9 months ago • 0 comments

Description

Fix logic for nvtext::character_tokenize to handle large strings input. The output for > 2GB input strings column will turn characters into rows and so will likely overflow the size_type rows as expected. The thrust::count_if is replaced with a raw kernel to produce the appropriate count that can be checked against max row size. Also changed the API to not accept null rows since the code does not check for them and can return invalid results for inputs with unsanitized-null rows.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

davidwendt avatar May 22 '24 21:05 davidwendt