iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

[Core][Spark] Improve DeleteOrphanFiles action to return additional details of deleted orphan files

Open sumeetgajjar opened this issue 2 years ago • 4 comments

What changes are proposed in this PR?

This PR adds a new Interface OrphanFileStatus that indicates if an orphan file was deleted or not. In cases of failure during file deletion, It provides a reference to the encountered exception.

With this additional information, a user can choose to record this failure and retry the deletion process in a controlled fashion in the future.

Why are the changes are needed?

During the execution of the DeleteOrphanFiles spark action or remove_orphan_files SQL procedure, a failure encountered during deletion of the orphan file is not bubbled up to the user nor is there any indication of the failure in the returned result.

The return value of the Spark action is Iterable<String> and the SQL procedure simply displays a list of orphan files in a table. With this limited information, the only way for the user to know if the orphan file was deleted or not is to

  • grep the logs for the warning
  • for each of the locations returned in the iterable, query the cloud storage for its existence

https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseDeleteOrphanFilesSparkAction.java#L225-L230

https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/core/src/main/java/org/apache/iceberg/actions/BaseDeleteOrphanFilesActionResult.java#L31

https://github.com/apache/iceberg/blob/71282b8ca7d0c703e4fd4ad460821eaec52124ce/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java#L57-L59

One can re-run the expensive delete action to delete the files that failed during the previous delete action run, however, that is not very desirable since re-listing the dir contents to identify the orphan files would result in duplicate work plus wastage of precious API calls. One of the common causes of delete failure in the public cloud is hitting the API quotas and unnecessary re-runs of the delete action can lead to a temporary denial of access to other workloads accessing the same storage resource.

Output

Before this change

DryRun => true/false

scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+
|orphan_file_location                                                                                                      |
+--------------------------------------------------------------------------------------------------------------------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|
+--------------------------------------------------------------------------------------------------------------------------+

After this change

DryRun => true

scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => true)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|orphan_file_location                                                                                                      |deleted|error_message|
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false   |null         |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|false   |null         |
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+

DryRun => false

scala> sql("CALL spark_catalog.system.remove_orphan_files(table => 't1', older_than => TIMESTAMP '2022-05-21 00:00:00.000', dry_run => false)").show(false)
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|orphan_file_location                                                                                                      |deleted|error_message|
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true   |null         |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-8251ec5f-dd9b-4754-8e68-2cdd01e66e56-c000.snappy.parquet|true   |null         |
+--------------------------------------------------------------------------------------------------------------------------+-------+-------------+

DryRun => false, simulating failure during delete

+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
|orphan_file_location                                                                                                      |deleted|error_message                          |
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00000-615125af-bfbb-4279-b22b-559dee4f0b13-c000.snappy.parquet|true   |null                                   |
|file:/tmp/iceberg_warehouse/default/t1/non_table_files/part-00001-705370fd-ea3e-4ae7-8b40-c0362b0ba7df-c000.snappy.parquet|false  |simulating failure during file deletion|
+--------------------------------------------------------------------------------------------------------------------------+-------+---------------------------------------+

sumeetgajjar avatar May 19 '22 17:05 sumeetgajjar

Hi @RussellSpitzer @aokolnychyi A gentle ping for the review.

sumeetgajjar avatar May 19 '22 17:05 sumeetgajjar

I can take a look at this after #4652 is in.

aokolnychyi avatar Jun 22 '22 20:06 aokolnychyi

I can take a look at this after #4652 is in.

Same

RussellSpitzer avatar Jun 23 '22 16:06 RussellSpitzer

I can take a look at this after #4652 is in.

Hi @RussellSpitzer and @aokolnychyi, since #4652 is merged can you please take a look at this PR?

sumeetgajjar avatar Jul 28 '22 16:07 sumeetgajjar

Thanks for the reviews @sririshindra @szehon-ho, really appreciate it.

There was an internal redistribution of tasks and a change of ownership of some GitHub PRs due to internal changes. I am closing this PR and moving forward @wypoon or @sririshindra would be going leading the changes suggested in this PR. They would pull the public branch associated with this PR, add their commits on top of it and create a new PR to ease the logistics of the transition.

Thanks.

sumeetgajjar avatar Jan 09 '23 22:01 sumeetgajjar