Support data files compaction
Introduce an API to compact data files. The first version of the API will do the following:
- take a predicate expression as input parameter to find data files matching the filter that will be re-written
- group data files by partitions and rewrite them using the same bin-packing constraints of the writer
Unassigning to work on other near-term priorities
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Is there any way to trigger compaction? The literature says that it's optimal to compact delete files back into data files to improve read space, and AFAICT there's no way to do this in PyIceberg.
Incidentally, is there a way to control whether your catalog uses copy-on-write vs. merge-on-read?
@sungwy
Since my task https://github.com/apache/iceberg-python/issues/1931#issuecomment-3002159502 is depending on the DeleteFileIndex so I am not going to work on it for now until the DeleteFileIndex task is complete.
At the mean time, wondering if I can take this task if you haven't started working on this? Thanks!