amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Bug]: Issue with Merging Parquet Files Without Field ID Leading to Misaligned Columns

Open wangmingjin163 opened this issue 6 months ago • 0 comments

What happened?

I encountered an issue when working with Parquet files in the Amoro project. The problem arises when Parquet files are written by using Arrow Schema without Field IDs, which later causes issues during file merging operations. Specifically, the columns in the merged files become misaligned, resulting in incorrect data projections. Screenshot 2024-08-20 at 16 41 11 Screenshot 2024-08-20 at 16 42 46

Affects Versions

0.7.0

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

Optimizer

How to reproduce

1.Create Parquet files using Iceberg schema without including Field IDs. 2.Attempt to merge these Parquet files using Iceberg’s rewriteDataFiles method. 3.Observe that the columns in the merged files are misaligned.

Relevant log output

No response

Anything else

Proposed Solution: I added a check to apply NameMapping during the Parquet file reading process. This ensures that fields are correctly mapped by name to their corresponding IDs, preventing misalignment during merging.

The key part of the solution involves using withNameMapping(NameMappingParser.fromJson(nameMapping)) in the Parquet.ReadBuilder when opening Parquet files. This ensures that the schema mapping is handled correctly, even in the absence of Field IDs.

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

wangmingjin163 avatar Aug 20 '24 08:08 wangmingjin163