Proof-of-concept Parquet GEOMETRY logical type implementation
Rationale for this change
This is a continuation of https://github.com/apache/arrow/pull/43196
In https://github.com/apache/parquet-format/pull/240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( https://github.com/apache/parquet-java/pull/1379 ). This is a PR to explore what an implementation would look like in C++.
What changes are included in this PR?
We are still in progress of completing all necessary changes to integrate geometry logical type support to the C++ implementation.
- [DONE] Adding geometry logical type
- [DONE] Adding geometry column statistics
- [DONE] Support reading/writing parquet files containing geometry columns
Are these changes tested?
The tests added only cover very basic use cases. Comprehensive tests will be added in future commits.
Are there any user-facing changes?
Yes! (And will eventually be documented)
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
Please ping me when ready for review again. Thanks!
Please ping me when ready for review again. Thanks!
@wgtmac I have added ColumnIndex and covering support. Please help review this PR when you have time, thank you.
I just finished reviewing it for the 2nd pass. Thanks for the great work!
My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?
I've changed the min/max statistics of geometry columns to be the WKB representation of lower-left and upper-right corners in the last commit according to https://github.com/apache/iceberg/pull/10981#discussion_r1766076050.