iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support metadata compaction

Open Fokko opened this issue 1 year ago • 12 comments

Feature Request / Improvement

Add support for compaction. This rewrites the existing manifests into a single one, reducing the number of calls to the object store. This should follow the Java configuration keys:

  • commit.manifest-merge.enabled: Controls whether to automatically merge manifests on writes.
  • commit.manifest.min-count-to-merge: Minimum number of manifests to accumulate before merging.
  • commit.manifest.target-size-bytes: Target size when merging manifest files.

Fokko avatar Jan 16 '24 12:01 Fokko

I am interested in taking this if no one has started working on it.

HonahX avatar Jan 30 '24 17:01 HonahX

Based on offline discussion with @Fokko, I will first focus on implementing the MergeAppend which supports these keys

  • commit.manifest-merge.enabled
  • commit.manifest.min-count-to-merge
  • commit.manifest.target-size-bytes

The MergeAppend will become the default append method since commit.manifest-merge.enabled is default to True. The PR for MergeAppend is https://github.com/apache/iceberg-python/pull/363

BTW, it seems rewrite_manifest operations only depends on the commit.manifest.target-size-bytes. Shall we update the description to reflect this?

HonahX avatar Feb 27 '24 07:02 HonahX

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 26 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 09 '24 00:09 github-actions[bot]

Hey, was wondering if there are no blockers if i can try to implement rewrite manifests??

amitgilad3 avatar Feb 06 '25 16:02 amitgilad3

sure thing @amitgilad3 Based on the conversation above, it looks like some of the components are already implemented

kevinjqliu avatar Feb 06 '25 22:02 kevinjqliu

Hi, recently I'm trying to investigate support rewrite manifest in iceberg-rust. And the design of iceberg-rust is following iceberg-python, basically, but for now, rewrite manifest is not supported in iceberg-python so I have to refer to the implementation of iceberg-java. In iceberg-java, the rewrite manifest is based on SnapshotProducer and I find that the design of SnapshotProducer between iceberg-java and python is a little different. In iceberg-python, SnapshotProducer is a more "fine grained" abstract, e.g. it provides the summary implementation, add_data_file interface. But in iceberg-java, the SnapshotProducer needs the child type to implement the summary. Which means that we can't directly implement rewrite manifest based on SnapshotProducer. In iceberg-python design, I can think of two ways to implement rewrite manifest:

  1. Don't base on SnapshotProducer
  2. Change the SnapshotProducer design to similar to java, and implement the rewrite manifest based on SnapshotProducer

I'm interested in which design iceberg-python will choice and as a refer for iceberg-rust.

ZENOTME avatar Mar 11 '25 11:03 ZENOTME

Hi @ZENOTME thanks for bringing this up. In pyiceberg, _SnapshotProducer defines the general structure of "things that are changed to produce a new snapshot."

The _DeleteFiles, _FastAppendFiles, and _OverwriteFiles follow this pattern. https://grep.app/search?f.repo=apache%2Ficeberg-python&q=%28_SnapshotProducer

I think we can implement metadata compaction by overriding the behaviors of https://github.com/apache/iceberg-python/blob/b86d7d5885c1f9feec86cbffcb818738e41cd6c1/pyiceberg/table/update/snapshot.py#L197-L199

kevinjqliu avatar Mar 11 '25 17:03 kevinjqliu

Looks like @amitgilad3 has already started a PR for Rewrite manifests in #1661

kevinjqliu avatar Mar 11 '25 17:03 kevinjqliu

Looks like @amitgilad3 has already started a PR for Rewrite manifests in https://github.com/apache/iceberg-python/pull/1661

Thanks @kevinjqliu! It's a good reference.

ZENOTME avatar Mar 11 '25 17:03 ZENOTME

feel free to help review the PR :) i haven't gotten to it yet

kevinjqliu avatar Mar 11 '25 18:03 kevinjqliu

Are there any updates here?

zschumacher avatar Jun 05 '25 19:06 zschumacher