arrow icon indicating copy to clipboard operation
arrow copied to clipboard

Proof-of-concept Parquet GEOMETRY logical type implementation

Open Kontinuation opened this issue 1 year ago • 4 comments

Rationale for this change

This is a continuation of https://github.com/apache/arrow/pull/43196

In https://github.com/apache/parquet-format/pull/240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( https://github.com/apache/parquet-java/pull/1379 ). This is a PR to explore what an implementation would look like in C++.

What changes are included in this PR?

We are still in progress of completing all necessary changes to integrate geometry logical type support to the C++ implementation.

  • [DONE] Adding geometry logical type
  • [DONE] Adding geometry column statistics
  • [DONE] Support reading/writing parquet files containing geometry columns

Are these changes tested?

The tests added only cover very basic use cases. Comprehensive tests will be added in future commits.

Are there any user-facing changes?

Yes! (And will eventually be documented)

Kontinuation avatar Sep 06 '24 02:09 Kontinuation

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Sep 06 '24 02:09 github-actions[bot]

Please ping me when ready for review again. Thanks!

wgtmac avatar Sep 12 '24 02:09 wgtmac

Please ping me when ready for review again. Thanks!

@wgtmac I have added ColumnIndex and covering support. Please help review this PR when you have time, thank you.

Kontinuation avatar Sep 17 '24 00:09 Kontinuation

I just finished reviewing it for the 2nd pass. Thanks for the great work!

My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?

I've changed the min/max statistics of geometry columns to be the WKB representation of lower-left and upper-right corners in the last commit according to https://github.com/apache/iceberg/pull/10981#discussion_r1766076050.

Kontinuation avatar Sep 19 '24 13:09 Kontinuation