iceberg
iceberg copied to clipboard
[Core][Spark] Improve DeleteOrphanFiles action to return additional details of deleted orphan files
What changes are proposed in this PR?
This PR adds a new Interface OrphanFileStatus
that indicates if an orphan file was deleted or not. In cases of failure during file deletion, It provides a reference to the encountered exception.
With this additional information, a user can choose to record this failure and retry the deletion process in a controlled fashion in the future.
Why are the changes are needed?
During the execution of the DeleteOrphanFiles spark action or remove_orphan_files SQL procedure, a failure encountered during deletion of the orphan file is not bubbled up to the user nor is there any indication of the failure in the returned result.
The return value of the Spark action is Iterable<String> and the SQL procedure simply displays a list of orphan files in a table. With this limited information, the only way for the user to know if the orphan file was deleted or not is to
- grep the logs for the warning
- for each of the locations returned in the iterable, query the cloud storage for its existence
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java#L225-L230
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/core/src/main/java/org/apache/iceberg/actions/BaseDeleteOrphanFilesActionResult.java#L31
https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java#L57-L59
One can re-run the expensive delete action to delete the files that failed during the previous delete action run, however, that is not very desirable since re-listing the dir contents to identify the orphan files would result in duplicate work plus wastage of precious API calls. One of the common causes of delete failure in the public cloud is hitting the API quotas and unnecessary re-runs of the delete action can lead to a temporary denial of access to other workloads accessing the same storage resource.
Output
Before this change
DryRun => true/false
scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+
|orphan_file_location |
+--------------------------------------------------------------------------------------------------------------------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
+--------------------------------------------------------------------------------------------------------------------------+
After this change
DryRun => true
scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|orphan_file_location |deleted|error_message|
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false |null |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false |null |
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
DryRun => false
scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => false)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|orphan_file_location |deleted|error_message|
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true |null |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true |null |
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
DryRun => false, simulating failure during delete
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
|orphan_file_location |deleted|error_message |
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-615125af-bfbb-4279-b22b-559dee4f0b13-c000.snappy.parquet|true |null |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|false |simulating failure during file deletion|
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
Hi @RussellSpitzer @aokolnychyi A gentle ping for the review.
I can take a look at this after #4652 is in.
I can take a look at this after #4652 is in.
Same
I can take a look at this after #4652 is in.
Hi @RussellSpitzer and @aokolnychyi, since #4652 is merged can you please take a look at this PR?
Thanks for the reviews @sririshindra @szehon-ho, really appreciate it.
There was an internal redistribution of tasks and a change of ownership of some GitHub PRs due to internal changes. I am closing this PR and moving forward @wypoon or @sririshindra would be going leading the changes suggested in this PR. They would pull the public branch associated with this PR, add their commits on top of it and create a new PR to ease the logistics of the transition.
Thanks.