iceberg
iceberg copied to clipboard
Spark: Remove backup table after a successful migrate action.
Once the MigrateTable action is called, a backup table is created. The backup table is used to restore the original table in case the MigrateTable action fails. This restoration is done by renaming the backup table to the original table. Therefore if the migrate operation fails there is no backup table lying around in the catalog. However, if the migration goes through successfully, the backup table is not cleaned up leaving the user to find a new unknown table in their catalog.
This PR cleans up the backup table after the migrate table action succeeds.
I tested this manually and I also added additional assertions to the existing unit tests to verify that the backup table is cleaned up properly.
@aokolnychyi , @rdblue , @danielcweeks Could you please take a look at this PR if you get a chance? Thanks.
I did notice it the other day, and also thought it would be a good usability feature (especially as there is snapshot table that keeps the original table). I wonder if need to have a drop_backup flag or something, in case someone depends on original behavior somehow? Let's see if everyone is ok with it.
@RussellSpitzer were you the original author of the migrate action (#1525)? Was the purpose of the backup table simply for restoring it in case the migrate failed? If so, it would make sense to remove the backup if the migrate succeeded, right?
I'm -1 on this. Not removing the backup table is intentional so that the user could choose which table to move forward with. Our docs included a step to "remove the backup table" once everything was running smoothly.
I'm fine adding this as an option, but I don't think it should be the default.
@rdblue can you please point me to the doc with a step to remove the backup table? I do not see it in https://iceberg.apache.org/docs/latest/spark-procedures/#migrate.
@rdblue In that case how would you feel about letting the user decide on the name of the backup table instead of the generic backup suffix appended to the table name? I think we should at least document this behavior so that users are not surprised to see a new table in their catalog that they did not create. I can also do a documentation update if that is what is more appropriate.
Docs are fine, but to me it would make more sense to add a non-default flag to drop_backup (as mentioned above), if we want to do add a feature here. I dont see letting the user decide name for backup is that helpful?
Docs are fine, but to me it would make more sense to add a non-default flag to drop_backup (as mentioned above), if we want to do add a feature here. I dont see letting the user decide name for backup is that helpful?
Sounds good. @szehon-ho I will update the PR to add the drop_table flag and make it optional. Thanks.
I understand the thinking behind keeping the backup table after a successful migrate action and having the user decide what to do with it. If somehow this isn't already documented, then @sririshindra you could open a doc PR to document it. I think giving the user an option to remove the backup table automatically upon a successful migrate action is still useful. We can have a boolean parameter, defaulting to false, for removing the backup table, and pass it in the migrate procedure. I'm neutral on having an option for naming the backup table.
Edit: I hadn't updated my browser and didn't see @szehon-ho's comment before posting; I agree with him.
@szehon-ho Please take another look at this PR when you get a chance. Thanks.
@szehon-ho Gentle ping to review this PR. Thanks.
Thanks for the changes, will merge tomorrow if no further comments.
Merged, thanks @sririshindra for the changes