[feature request] Support reading equality delete files
Feature Request / Improvement
Only position delete is supported right now https://github.com/apache/iceberg-python/blob/e5a58b34dd830c6ffea11649613b693f70f7cbb4/pyiceberg/table/init.py#L1418
Let's also add reading equality delete
Position delete PR https://github.com/apache/iceberg/pull/6775
Thanks @kevinjqliu, I can work on this issue
This will be a fantastic addition to PyIceberg! Thank you for raising this issue @kevinjqliu and @Zyiqin-Miranda 🎉
Thanks @kevinjqliu and @sungwy. Starting to add support to current plan_files function for equality deletes, not sure if the current _InclusiveMetricsEvaluator can be directly used to determine whether the equality delete files is relevant to the data files?
Seems like Iceberg Java uses canContainEqDeletesForFile instead.
My understanding is that position deletes can use lower_bound == upper_bound of file_path column to filter out irrelevant files quickly but equality deletes don't have this advantage, so basically equality deletes can be relevant to any data files within same partition. Thanks for any insights here in advance!
Equality Delete Files and Scan Planning are good docs for this.
My general understanding is that equality deletes are applied to all data files (across all partitions, if partitioned).
Position delete files must be applied to data files from the same commit, when the data and delete file data sequence numbers are equal. This allows deleting rows that were added in the same commit.
@Zyiqin-Miranda is there any progress on supporting equality deletes in pyiceberg ?
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
@sfc-gh-mrojas https://github.com/apache/iceberg-python/pull/2255