Improve performance of extracting statistics from parquet files
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10453
@Lordworms added a benchmark for extracting statistics from parquet files in https://github.com/apache/datafusion/pull/10610
As this code can be used to extract statistics from parquet files, we would like to make sure it is efficient (especially if we are going to extract statistics for many files at once)
The idea here is to improve the speed of the statistics extraction
Describe the solution you'd like
Make this go faster
cargo bench --bench parquet_statistic
Describe alternatives you've considered
I did some brief profiling:
I think they key would be to change these loops so they built the required Arrow Arrays directly from primitive values rather than from ScalarValue:
https://github.com/apache/datafusion/blob/1bf7112171fd820c101e325822dc4d44dd65b2ff/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L183-L189
Additional context
No response
I was thinking about this last night -- what I would suggest for this is to
make functions like this for each type of https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html
/// Returns an iterator over min values stored in `ValueStatistics<i32>`
fn extract_i32_mins(stats: impl IntoIterator<&Statistics>) -> impl Iterator<Item = Option<i32>> {
...
}
And then with those iterators, we can make the arrays directly
something like
let Int32ArrayMins = Int32Aray::from_iter(extract_i32_mins(stats));
I think that would be both simple and fast.