amoro
amoro copied to clipboard
[Improvement]: Manifest-driven and partition-aware data expiration
Search before asking
- [x] I have searched in the issues and found no similar issues.
What would you like to be improved?
The previous implementation performed a global scan on partitioned tables, which often caused OOM issues when handling large Iceberg tables. The main reasons are:
- High memory consumption when a table contains a large number of files;
- Loading too many column stats, especially from delete files
- Lack of filtering on the tables that actually need to be processed.
How should we improve?
We propose a manifest-based, partition-aware data expiration approach:
- Identify candidate manifest files based on their partition boundaries and expire files that do not meet retention conditions;
- Iterate through manifest files sequentially to collect partition and file-level information;
- Perform expiration in a partition-by-partition manner, which allows submitting cleanup tasks per partition.
Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
Subtasks
No response
Code of Conduct
- [x] I agree to follow this project's Code of Conduct