datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Ballista should serialize Parquet statistics

Open andygrove opened this issue 3 years ago • 1 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. When the Ballista scheduler or executor deserializes a ParquetExec it collects the statistics again and this is redundant. We should serialize the statistics to avoid this extra work.

Describe the solution you'd like Add Parquet statistics to serde module.

Describe alternatives you've considered N/A

Additional context N/A

andygrove avatar Aug 13 '21 02:08 andygrove

In apache/arrow-datafusion#962 I am considering the possibility to make the statistics part of the ExecutionPlan trait (and remove them from TableProvider). But I think that not all nodes will have a cached version of the statistics, only those nodes for which it is an expensive operation to fetch them and that know that the they will not change.

We will probably not need the statistics on the executor, because I doubt that any re-optimization will take place there. So it might be an optimization further down the road to optionally leave them out of the serialization in that case.

rdettai avatar Aug 31 '21 15:08 rdettai