iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Parquet, Arrow: Refactor vectorized reader

Open wgtmac opened this issue 1 year ago • 9 comments

This patch is the preparation work of https://github.com/apache/iceberg/issues/7162 and it contains following changes:

  • Add VectorizedValuesReader interface for future extension.
  • Refactor VectorizedPageIterator to use VectorizedValuesReader.
  • Consolidate duplicate code in the VectorizedPageIterator.

wgtmac avatar Feb 22 '24 06:02 wgtmac

I plan to resolve https://github.com/apache/iceberg/issues/7162 by adding vectorized readers for all v2 encodings. This is the 1st patch. Would you mind taking a look? @rdblue @nastra @Fokko @jackye1995

wgtmac avatar Feb 22 '24 15:02 wgtmac

@wgtmac thanks so much for taking this issue on! I am just a user of iceberg (I have never contributed), but I was curious--is it also possible to include a resolution for https://github.com/apache/iceberg/issues/521 as an extension of this work? It would be a huge boon for iceberg performance since I often use complex/nested types, and I'm sure others do as well.

anthonysgro avatar Mar 11 '24 19:03 anthonysgro

@anthonysgro Yes, that is on my radar. I will work on it once v2 encodings of all primitive types have been supported.

wgtmac avatar Mar 12 '24 01:03 wgtmac

I plan to resolve https://github.com/apache/iceberg/issues/7162 by adding vectorized readers for all v2 encodings. This is the 1st patch.

@wgtmac do you by any chance have all other changes that are required for v2 encodings? I'd like to see and understand how the changes in this PR fit into the bigger picture (It's been a while since I touched/reviewed the vectorized code paths)

nastra avatar Mar 15 '24 14:03 nastra

Not yet. My rough plan is to do following things:

  1. add a new VectorizedValuesReader base class to supporting different encodings. This is similar to what spark does but reading into arrow field vector: :https://github.com/apache/spark/blob/b7aa9740249b50ad9db254626c530ff5bc33d385/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java#L30
  2. extend VectorizedValuesReader to add v2 encodings one by one.
  3. support vectorized readers for nested types.

This patch is the step 1 above and only added vectorized reading interfaces for float/double/int32/int64 physical types. It already shows the big picture that how following steps will be done. More interfaces will be added progressively for reading other physical types and logical types into arrow field vectors in order to have a better review experience.

Does this make sense to you? @nastra

wgtmac avatar Mar 15 '24 15:03 wgtmac

Since this feature has been requested in the community but the PR is not reviewed for a long time. Do you think this is the right direction here? Or should I directly support reading arrow vectors from the parquet-java library? @nastra @aokolnychyi @Fokko @szehon-ho

wgtmac avatar Jul 24 '24 14:07 wgtmac

hey @wgtmac sorry for the long wait here but I don't have any spare cycles atm to review this.

nastra avatar Jul 24 '24 15:07 nastra