iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Add files to add existing Parquet files to a table

Open ZENOTME opened this issue 10 months ago • 6 comments

In #345, we support writing new data files and appending them to the table. But we haven't support appending existing data files which need to support reading existing data files and generating corresponding metadata DataFile.

ZENOTME avatar Feb 01 '25 10:02 ZENOTME

I would like to try working on this.

jonathanc-n avatar Feb 02 '25 22:02 jonathanc-n

I would like to try working on this.

Thanks @jonathanc-n! Feel free to send the PR for this.

ZENOTME avatar Feb 05 '25 05:02 ZENOTME

@ZENOTME When appending existing data files, should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata? I'm looking to just perform a TableScan based the answer and have it just add the DataFiles with the add_data_file.

jonathanc-n avatar Feb 05 '25 21:02 jonathanc-n

@ZENOTME When appending existing data files, should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata? I'm looking to just perform a TableScan based the answer and have it just add the DataFiles with the add_data_file.

Hi @jonathanc-n, I think we can refer the implementation of pyiceberg: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/init.py#L669C9-L669C18.

should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata?

I think the user will add file using transaction API so we can know which table it will be append and related metadata.

ZENOTME avatar Feb 06 '25 06:02 ZENOTME

@liurenjie1024 @jonathanc-n should this be closed now that https://github.com/apache/iceberg-rust/pull/960 is in?

mkarbo avatar Mar 10 '25 12:03 mkarbo

Don't believe so, there a bunch of follow up prs that should be done before this is closed

jonathanc-n avatar Mar 10 '25 17:03 jonathanc-n

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Sep 13 '25 00:09 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 28 '25 00:09 github-actions[bot]