iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Core: Fix confusing log from RemoveSnapshots

Open ajantha-bhat opened this issue 2 years ago • 8 comments

By default, ExpireSnapshotsSparkAction cleans the expired files but the log says not cleaning the files.

22/08/09 11:03:43 INFO RemoveSnapshots: Cleaning up manifest and data files disabled, leaving them in place

So, this log is confusing the users.

ExpireSnapshotsSparkAction calls some code that is common for non-distributed and distributed expire snapshot job. So, update the common code to print the clean up logs only if the files are cleaned up.

ajantha-bhat avatar Aug 09 '22 09:08 ajantha-bhat

I don't agree with this change. I think it's good to know that this is skipped, rather than not being able to distinguish between missing logs and not trying to remove the files.

rdblue avatar Aug 09 '22 16:08 rdblue

We could just move the log line to the "clean expired" branch of the if statement, and log "Local removal of files enabled. Cleaning up manifest and data files"?

That way you can differentiate between "cleanup disabled" and "empty"?

RussellSpitzer avatar Aug 09 '22 16:08 RussellSpitzer

@RussellSpitzer : Thanks for the suggestion. I added a log at the proper place to identify the empty case.

@rdblue : If the logs are not present, we can assume cleanup is disabled? if it is not disabled, some logs related to clean-up will be printed. For the users who run the ExpireSnapshotsAction this log was confusing as mentioned in the description. Also for ExpireSnapshotsAction, internal clean-up is always disabled.

ajantha-bhat avatar Aug 09 '22 16:08 ajantha-bhat

I would like a lot to distinguish between "didn't try" and "nothing to clean up" -- even when "nothing to clean up" is because there were no expired files.

rdblue avatar Aug 09 '22 16:08 rdblue

For the log that is causing confusion for users because the distributed engines (in this case Spark) then go on to do some cleanup, maybe we can update the log to state something similar to we're "skipping clean up during current iteration, cleanup might occur later if this action is run from a distributed processing engine"? Then possibly add logs to Spark as well?

I would like a lot to distinguish between "didn't try" and "nothing to clean up" -- even when "nothing to clean up" is because there were no expired files.

Maybe we need a flag for delayedFileCleanup or just generally runningViaDistributedAction -- and then let Spark / engines control the logs?

I don't love it, but throwing it out there as the current logging is admittedly confusing when the Spark action runs. And the Spark action / procedure is likely the most common way users interact with this action.

EDIT - Given that snapshot expiration can happen at the end of any write, I see the significance of these logs and the Spark action is arguably not the most common way this occurs, though the most common way that users explicitly run this action is still likely through Spark.

kbendick avatar Aug 09 '22 18:08 kbendick

Maybe we need a flag for delayedFileCleanup or just generally runningViaDistributedAction -- and then let Spark / engines control the logs?

@kbendick : I originally had the same thought. But it is making things more complex as we are linking core logic to callers.
I think the current changes in the PR is enough to clear the confusion (It will also differentiates "disabled" vs "nothing to clean up")

ajantha-bhat avatar Aug 12 '22 06:08 ajantha-bhat

@rdblue, @RussellSpitzer, @Fokko : Can this PR be merged?

ajantha-bhat avatar Sep 09 '22 07:09 ajantha-bhat

ping @rdblue

ajantha-bhat avatar Sep 23 '22 16:09 ajantha-bhat

The below test case could be flaky!

org.apache.iceberg.spark.TestFileRewriteCoordinator > testBinPackRewrite[catalogName = testhive https://github.com/apache/iceberg/actions/runs/3249433325/jobs/5331831494#step:6:293

ajantha-bhat avatar Oct 14 '22 12:10 ajantha-bhat

Thanks, @ajantha-bhat!

rdblue avatar Oct 14 '22 20:10 rdblue