[FEA] Increase reader throughput by pipelining IO and compute
-- this is a draft, please do not comment yet --
The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source. We can use "pipelining" to overlap processing data on the device with reading data from the data source. Pipelining works by processing the data in batches, so that the previous chunk can be processed as the next chunk is reading. Pipelined readers show higher end-to-end throughput if the overlap between reading and processing is greater than the overhead from processing smaller batches.
In cuIO, multibyte_split used a pipelined design that reads text data in ~33 MB chunks (2^25 bytes) into a pinned host buffer, copies the data to device, and then generates offsets data. Here's a profile reading "Common Crawl" document data with cudf.read_text from a 410 MB file:
Note how the get_next_chunk function includes the OS read and Memcopy HtoD, and how the Memcpy HtoD overlaps with the next OS read. Stream-ordered kernel launches also overlap with the next OS read. For each 10 ms OS read, there is 1.5 ms of overlapping copy/compute work and 0.2 ms of overhead between each OS read.
We can applying pipelining to the Parquet reader as well. Parquet reading includes several major steps: raw IO, header decoding, decompression, and data decoding. The runtime of each step varies based on the properties of the data in the file, including the data types, encoding efficiency, and compression efficiency. Furthermore Parquet files have internal row group and page structure that restricts how the file can be split. Here is an example profile reading the same "Common Crawl" data as above, but from a 240 MB Snappy-compressed Parquet file:
Note how 90 ms is spent in OS read on the file and ~20 ms is spent processing, with decompression taking most (11.5 ms) of the processing time. Also note the GPU utilization data during the read_parquet function, with zero GPU utilization during the copy followed by good good SM utilization and moderate warp utilization during the compute.
We've completed prototyping work in #12358, experimenting with several approaches for pipelining the Parquet reader. Here are some performance analysis ideas for the next time we tackle this feature:
- Curate a library of real world (not generated) data files and use that to evaluate the performance of pipelining approach
- Analyze the copying, decompression, decoding times in the curated library and track which files show the biggest benefit from pipelining
- Consider setting a floor (such as 200 MB of compressed data) before pipelining kicks in, to make sure we aren't accruing too much overhead
- Evaluate network-attached storage in addition to local NVMe data sources
As far as pipelining approaches, here are some areas to consider:
| Stream usage | Chunking pattern | Notes |
|---|---|---|
| entire read per stream | row group | tbd |
| decompression stream and decoding stream | row group | tbd |
-- this is a draft, please do not comment yet --