amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Improvement]: Manifest-driven and partition-aware data expiration

Open xxubai opened this issue 4 months ago • 0 comments

Search before asking

  • [x] I have searched in the issues and found no similar issues.

What would you like to be improved?

The previous implementation performed a global scan on partitioned tables, which often caused OOM issues when handling large Iceberg tables. The main reasons are:

  1. High memory consumption when a table contains a large number of files;
  2. Loading too many column stats, especially from delete files
  3. Lack of filtering on the tables that actually need to be processed.

How should we improve?

We propose a manifest-based, partition-aware data expiration approach:

  • Identify candidate manifest files based on their partition boundaries and expire files that do not meet retention conditions;
  • Iterate through manifest files sequentially to collect partition and file-level information;
  • Perform expiration in a partition-by-partition manner, which allows submitting cleanup tasks per partition.

Are you willing to submit PR?

  • [x] Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

xxubai avatar Aug 25 '25 14:08 xxubai