datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support `Time` Parquet Data Page Statistics

Open alamb opened this issue 1 year ago • 6 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L637-L656

Describe alternatives you've considered

You can follow the model from @Weijun-H in https://github.com/apache/datafusion/pull/10931

  1. Update the test for the listed data types to be Check::Both, following the model of test_int64 https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/tests/parquet/arrow_statistics.rs#L506-L529
  2. Add any required implementation in get_datapage_statistics: https://github.com/apache/datafusion/blob/459afbb3a180d31e7cdefffb46f033069aa47408/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L624 (follow the model of the row counts, https://github.com/apache/datafusion/blob/2f4347647172f6997448b2e24d322b50c856f3a0/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L90)

Typically the change to the test looks like

- check: Check::RowGroup, 
+ check: Check::Both, 

Additional context

No response

alamb avatar Jun 24 '24 21:06 alamb

Can I work on this issue? kindly assign it to me!

myeunee avatar Jun 25 '24 04:06 myeunee

Hi @myeunee, you can just comment "take" and it will be automatically assigned to you.

MohamedAbdeen21 avatar Jun 25 '24 17:06 MohamedAbdeen21

Can I work on this issue? kindly assign it to me!

Also in general, feel free to work on any issue -- https://datafusion.apache.org/contributor-guide/index.html#finding-and-creating-issues-to-work-on 🚀

alamb avatar Jun 25 '24 23:06 alamb

It looks like this issue is one of the last needed to complete the data page statistics extraction feature 🤔

alamb avatar Jun 30 '24 14:06 alamb

@alamb Since this is the only thing pending. I took the liberty to quickly jump in and raised a PR. @myeunee Please excuse me.

dharanad avatar Jun 30 '24 18:06 dharanad

Thank you @dharanad

alamb avatar Jun 30 '24 19:06 alamb