iceberg
iceberg copied to clipboard
Core: Fix confusing log from RemoveSnapshots
By default, ExpireSnapshotsSparkAction
cleans the expired files but the log says not cleaning the files.
22/08/09 11:03:43 INFO RemoveSnapshots: Cleaning up manifest and data files disabled, leaving them in place
So, this log is confusing the users.
ExpireSnapshotsSparkAction calls some code that is common for non-distributed and distributed expire snapshot job. So, update the common code to print the clean up logs only if the files are cleaned up.
I don't agree with this change. I think it's good to know that this is skipped, rather than not being able to distinguish between missing logs and not trying to remove the files.
We could just move the log line to the "clean expired" branch of the if statement, and log "Local removal of files enabled. Cleaning up manifest and data files"?
That way you can differentiate between "cleanup disabled" and "empty"?
@RussellSpitzer : Thanks for the suggestion. I added a log at the proper place to identify the empty case.
@rdblue : If the logs are not present, we can assume cleanup is disabled? if it is not disabled, some logs related to clean-up will be printed. For the users who run the ExpireSnapshotsAction
this log was confusing as mentioned in the description.
Also for ExpireSnapshotsAction
, internal clean-up is always disabled.
I would like a lot to distinguish between "didn't try" and "nothing to clean up" -- even when "nothing to clean up" is because there were no expired files.
For the log that is causing confusion for users because the distributed engines (in this case Spark) then go on to do some cleanup, maybe we can update the log to state something similar to we're "skipping clean up during current iteration, cleanup might occur later if this action is run from a distributed processing engine"? Then possibly add logs to Spark as well?
I would like a lot to distinguish between "didn't try" and "nothing to clean up" -- even when "nothing to clean up" is because there were no expired files.
Maybe we need a flag for delayedFileCleanup
or just generally runningViaDistributedAction
-- and then let Spark / engines control the logs?
I don't love it, but throwing it out there as the current logging is admittedly confusing when the Spark action runs. And the Spark action / procedure is likely the most common way users interact with this action.
EDIT - Given that snapshot expiration can happen at the end of any write, I see the significance of these logs and the Spark action is arguably not the most common way this occurs, though the most common way that users explicitly run this action is still likely through Spark.
Maybe we need a flag for delayedFileCleanup or just generally runningViaDistributedAction -- and then let Spark / engines control the logs?
@kbendick : I originally had the same thought. But it is making things more complex as we are linking core logic to callers.
I think the current changes in the PR is enough to clear the confusion (It will also differentiates "disabled" vs "nothing to clean up")
@rdblue, @RussellSpitzer, @Fokko : Can this PR be merged?
ping @rdblue
The below test case could be flaky!
org.apache.iceberg.spark.TestFileRewriteCoordinator > testBinPackRewrite[catalogName = testhive
https://github.com/apache/iceberg/actions/runs/3249433325/jobs/5331831494#step:6:293
Thanks, @ajantha-bhat!