iceberg
iceberg copied to clipboard
Parquet, Arrow: Refactor vectorized reader
This patch is the preparation work of https://github.com/apache/iceberg/issues/7162 and it contains following changes:
- Add VectorizedValuesReader interface for future extension.
- Refactor VectorizedPageIterator to use VectorizedValuesReader.
- Consolidate duplicate code in the VectorizedPageIterator.
I plan to resolve https://github.com/apache/iceberg/issues/7162 by adding vectorized readers for all v2 encodings. This is the 1st patch. Would you mind taking a look? @rdblue @nastra @Fokko @jackye1995
@wgtmac thanks so much for taking this issue on! I am just a user of iceberg (I have never contributed), but I was curious--is it also possible to include a resolution for https://github.com/apache/iceberg/issues/521 as an extension of this work? It would be a huge boon for iceberg performance since I often use complex/nested types, and I'm sure others do as well.
@anthonysgro Yes, that is on my radar. I will work on it once v2 encodings of all primitive types have been supported.
I plan to resolve https://github.com/apache/iceberg/issues/7162 by adding vectorized readers for all v2 encodings. This is the 1st patch.
@wgtmac do you by any chance have all other changes that are required for v2 encodings? I'd like to see and understand how the changes in this PR fit into the bigger picture (It's been a while since I touched/reviewed the vectorized code paths)
Not yet. My rough plan is to do following things:
- add a new VectorizedValuesReader base class to supporting different encodings. This is similar to what spark does but reading into arrow field vector: :https://github.com/apache/spark/blob/b7aa9740249b50ad9db254626c530ff5bc33d385/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java#L30
- extend VectorizedValuesReader to add v2 encodings one by one.
- support vectorized readers for nested types.
This patch is the step 1 above and only added vectorized reading interfaces for float/double/int32/int64 physical types. It already shows the big picture that how following steps will be done. More interfaces will be added progressively for reading other physical types and logical types into arrow field vectors in order to have a better review experience.
Does this make sense to you? @nastra
Since this feature has been requested in the community but the PR is not reviewed for a long time. Do you think this is the right direction here? Or should I directly support reading arrow vectors from the parquet-java library? @nastra @aokolnychyi @Fokko @szehon-ho
hey @wgtmac sorry for the long wait here but I don't have any spare cycles atm to review this.