[FEA] Support read_text using a byte range without scanning the full source file
Is your feature request related to a problem? Please describe. The current implementation of multibyte_split supports a byte range input to reading of limited portions of large files. However, even with a byte range provided the kernel scans the full source file using the function multibyte_split_scan_full_source. When each worker in a distributed workflow reads the entire source file, the largest file we can process is around 10GB before the workflow becomes bottlenecked by IO.
Describe the solution you'd like We would like a solution that accelerates the reading of large files. Some possible solutions:
- create an API for
multibyte_split_scan_full_sourceso that the user can get all the record offsets, and then modifymultibyte_splitto accept byte ranges aligned with record offsets and skip callingmultibyte_split_scan_full_source. - enable a new function
multibyte_split_scan_byte_rangethat only returns record offsets from the current byte_range, ignoring the possibility of quoted delimiters. Allow users to opt-in to this behavior.
Describe alternatives you've considered We could break the user files into smaller pieces before reading. Some of the files are a few TB and this would create an unnecessary burden.
Additional context TBD
enable a new function multibyte_split_scan_byte_range that only returns record offsets from the current byte_range, ignoring the possibility of quoted delimiters. Allow users to opt-in to this behavior.
AFAIK, could also be an option of multibyte_split.
In a distributed scenario where each node processes a byte range, all byte ranges are adjacent and contiguous w.r.t. one another, and the byte ranges cover the entirety of the file, there is a third option: each node scans only a subsection of the file, and shares intermediate state between other nodes.