iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Prevent `add_files` from adding a file that's already referenced by the Iceberg Table

Open sungwy opened this issue 1 year ago • 3 comments

Feature Request / Improvement

Currently add_files doesn't have a check to prevent adding an object that's already referenced by the Iceberg Table.

We should include these two checks to prevent bad behaviors of adding an already referenced data file as a new manifest entry.

We could do this by running the following two checks before the file addition:

  1. First check that the list of file_paths is unique
  2. Check that all the files in the file_paths aren't referenced by any of the manifests in the current snapshot of the Iceberg Table.

sungwy avatar Aug 05 '24 16:08 sungwy

Hey, im new to pyiceberg but would love to take a crack at this

amitgilad3 avatar Aug 05 '24 18:08 amitgilad3

Hi @amitgilad3 sounds great! I'll get this assigned to you. Please let me know if you'd like some pointers :)

sungwy avatar Aug 05 '24 21:08 sungwy

Hey @sungwy - just created my first pr #1036 , would really appreciate your review and if you have any suggestions or if i choose the wrong place to implement my checks.

amitgilad3 avatar Aug 10 '24 14:08 amitgilad3