iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Add files support for parquet field_ids

Open MrDerecho opened this issue 6 months ago • 4 comments

Feature Request / Improvement

Would it be possible to allow for parquet field_id support in the add_files method? Parquet field id's are a requirement for backwards compatibility with spark based tools- and also, files created by these systems would have field_ids present in the parquet files. This would allow for mass-migrations of spark generated iceberg objects without SerDe (reprocessing the parquet files) to be supported by pyiceberg.

MrDerecho avatar Jun 20 '25 21:06 MrDerecho

allow for parquet field_id support in the add_files method?

this is already done automatically based on the table metadata. https://github.com/apache/iceberg-python/blob/89e71c36f26d1f3da48090ddfa137a698e2a06fc/pyiceberg/table/init.py#L855-L858

You can also specify your own name-mapping by updating the table properties

kevinjqliu avatar Jun 22 '25 18:06 kevinjqliu

If you try to add parquet files which already have field ids you would get this error.

Erigara avatar Jun 25 '25 10:06 Erigara

I think @MrDerecho is wanting a feature added so that add_files() ignores field ids that are present in existing parquet files.

ForeverAngry avatar Jun 28 '25 00:06 ForeverAngry

I was also affected by this constraint, so I created a PR to relax it and only fail if the field IDs from the file are not compatible with the table's field IDs.

jeroko avatar Oct 27 '25 15:10 jeroko