cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Support read_text using a byte range without scanning the full source file

Open GregoryKimball opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. The current implementation of multibyte_split supports a byte range input to reading of limited portions of large files. However, even with a byte range provided the kernel scans the full source file using the function multibyte_split_scan_full_source. When each worker in a distributed workflow reads the entire source file, the largest file we can process is around 10GB before the workflow becomes bottlenecked by IO.

Describe the solution you'd like We would like a solution that accelerates the reading of large files. Some possible solutions:

  1. create an API for multibyte_split_scan_full_source so that the user can get all the record offsets, and then modify multibyte_split to accept byte ranges aligned with record offsets and skip calling multibyte_split_scan_full_source.
  2. enable a new function multibyte_split_scan_byte_range that only returns record offsets from the current byte_range, ignoring the possibility of quoted delimiters. Allow users to opt-in to this behavior.

Describe alternatives you've considered We could break the user files into smaller pieces before reading. Some of the files are a few TB and this would create an unnecessary burden.

Additional context TBD

GregoryKimball avatar Jul 05 '22 02:07 GregoryKimball

enable a new function multibyte_split_scan_byte_range that only returns record offsets from the current byte_range, ignoring the possibility of quoted delimiters. Allow users to opt-in to this behavior.

AFAIK, could also be an option of multibyte_split.

vuule avatar Jul 05 '22 20:07 vuule

In a distributed scenario where each node processes a byte range, all byte ranges are adjacent and contiguous w.r.t. one another, and the byte ranges cover the entirety of the file, there is a third option: each node scans only a subsection of the file, and shares intermediate state between other nodes.

cwharris avatar Jul 26 '22 14:07 cwharris