iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support data files compaction

Open sungwy opened this issue 1 year ago • 2 comments

Introduce an API to compact data files. The first version of the API will do the following:

  • take a predicate expression as input parameter to find data files matching the filter that will be re-written
  • group data files by partitions and rewrite them using the same bin-packing constraints of the writer

sungwy avatar Aug 22 '24 18:08 sungwy

Unassigning to work on other near-term priorities

sungwy avatar Sep 24 '24 17:09 sungwy

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Mar 24 '25 00:03 github-actions[bot]

Is there any way to trigger compaction? The literature says that it's optimal to compact delete files back into data files to improve read space, and AFAICT there's no way to do this in PyIceberg.

Incidentally, is there a way to control whether your catalog uses copy-on-write vs. merge-on-read?

zbs avatar Jun 01 '25 05:06 zbs

@sungwy

Since my task https://github.com/apache/iceberg-python/issues/1931#issuecomment-3002159502 is depending on the DeleteFileIndex so I am not going to work on it for now until the DeleteFileIndex task is complete.

At the mean time, wondering if I can take this task if you haven't started working on this? Thanks!

yingjianwu98 avatar Jun 26 '25 21:06 yingjianwu98