datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support `DictionaryArray` Parquet Data Page Statistics

Open alamb opened this issue 1 year ago • 1 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significiantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L637-L656

Describe alternatives you've considered

You can follow the model from @Weijun-H in https://github.com/apache/datafusion/pull/10931

  1. Update the test for the listed data types (I think it is test_binary) following the model of test_int64 https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/tests/parquet/arrow_statistics.rs#L506-L529

  2. Add any required implementation in https://github.com/apache/datafusion/blob/2f4347647172f6997448b2e24d322b50c856f3a0/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L575-L586 (follow the model of the row counts, https://github.com/apache/datafusion/blob/2f4347647172f6997448b2e24d322b50c856f3a0/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L90)

Additional context

No response

alamb avatar Jun 30 '24 14:06 alamb

take

dharanad avatar Jun 30 '24 14:06 dharanad

@dharanad - I have a bit of time today and could pick up the data page stats for this one and/or the FixedSizeByteArray stats to unblock the remaining tasks in this epic but wouldn't bother if you're actively working on them.

efredine avatar Jul 01 '24 14:07 efredine

Hello @efredine , sure thing. I will unassigned myself from this issue, you can pick this one up. I will continue my work on FixedSizeByteArray

dharanad avatar Jul 01 '24 14:07 dharanad

take

efredine avatar Jul 01 '24 14:07 efredine