datafusion
datafusion copied to clipboard
Parallel fetching of column chunks when reading parquet files
Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)
See https://github.com/apache/arrow-rs/issues/2110 for the background.
Some object stores are able to efficiently parallelize reads so it would be great to take advantage of that in DataFusion.
https://github.com/apache/arrow-rs/pull/2115 should add the plumbing in arrow-rs
to allow DataFusion to decide how to best fetch multiple byte ranges (in this case, column chunks from a parquet file).
Describe the solution you'd like A clear and concise description of what you want to happen.
I think there are two possible solutions:
- Make
ParquetFileReader
parallelize range requests by default, perhaps with some configurable options for max parallelism, etc. - Push this back to the
ObjectStore
crate and just add aget_ranges(&self, location: &Path, ranges: Range<usize>)
method and let the object store implementation decide what is the most efficient way to fetch multiple byte ranges.
I think 2 seems like a better solution but it involves making a change to a public trait in another crate (for the moment at least) which makes things logistically more complicated.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
We could not do anything.
Additional context Add any other context or screenshots about the feature request here.