API: Add RemoveUnusedSpecs in Table
This is a continue work of #3462, all the credits should goes to @RussellSpitzer.
Previously there was no way to remove partition specs from a table once they were added. To fix this we add an api which searches through all reachable manifest files and records their specsIds. Any specIds which do not find are marked for removal which is done through a serializable commit.
@amogh-jahagirdar @RussellSpitzer @aokolnychyi @szehon-ho it would be great if you guys could take a look at this.
This PR is raised based the previous discussion in https://github.com/apache/iceberg/pull/10352#discussion_r1605818799
I don't think I'd go with a general SetPartitionSpecs update, I think I'd have a RemovePartitionSpec, and the TableMetadata builder APIs to remove a given spec (which will have validation that we're not removing the current spec, the spec to remove is a valid partition spec etc).
This is a nice suggestion and better approach. I will change the withSpecs/setSpecs by using removePartitionSpec.
Replying other comments inline.
- Should we included removing unused schemas?
No, and I think we are on the same page. PruneUnusedSchemas should be in a separate and dedicated action. However, it might not be possible to do that in current code as there's no SchemaId bounding to DataFile/DeleteFile. It's impossible to decide which schemas are unused?
In this approach I think we'd have to do a REST spec change to introduce the new update type. If we think this API is worth it for general purpose metadata cleaning, beyond preventing the drop column issue then we'd probably have to go through with the spec change. However, if this is only being introduced for the drop column issue, maybe we want to think through lighter weight options to solve that particular issue?
I don't think we should expose removePartitionSpec to the REST catalog, at least for now. I think RemoveUnusedSpecs is a general metadata cleaning API and worth introducing, however the API is self-contained and doesn't have to be coupled with REST catalog. Like the comment in the TableMetadata, it's not safe for external client to simply call removePartitionSpec without checking the spec is unused, which might not an easy task in the REST catalog. Unless we think it's worthy to add check by reading manifest files in the REST catalog, in that way, we may expose the removePartitionSpec to the REST catalog. That's why the org.apache.iceberg.MetadataUpdate.SetPartitionSpecs#applyTo method throws an exception instead of actually implementing it.
If introducing a new MetadataUpdate implies a REST spec change, I think we can change the TableMetadata.Builder#build to build without adding a RemovePartitionSpec change, how does that sounds to you?
Close and re-open to trigger the CI.
Also gently ping @amogh-jahagirdar @RussellSpitzer to take another look.
@advancedxy Just FYI, I've raised #10846 for the rest spec changes, I'll also start a mailing list discussion on the new update type.
@advancedxy Just FYI, I've raised #10846 for the rest spec changes, I'll also start a mailing list discussion on the new update type.
Thank you for taking the effort, that's great.
@amogh-jahagirdar @RussellSpitzer and @rdblue since #10846 was already merged, let's restart this PR.
Do you think we should add the RemovePartitionSpec support in the REST catalog in this PR as well or we can do that in a follow-up PR?
Gently ping @amogh-jahagirdar @rdblue @RussellSpitzer
Sorry for missing following up on this @advancedxy yes I think we should send the new update type as part of this PR.that would require some parser changes as well for the update type. Let me know if any help is needed, I can try taking a look at this tomorrow if we wanted to get in the parser changes first and then rebase this PR on top of it so this PR is more focused on the operation parts.
yes I think we should send the new update type as part of this PR.that would require some parser changes as well for the update type. Let me know if any help is needed, I can try taking a look at this tomorrow if we wanted to get in the parser changes first
OK, no problem. Thanks for offering the help, I think I can add these relevant codes in this PR as well. We can split the parser changes as a separate PR if needed.
@amogh-jahagirdar Updated, It would be great if you can take another look at this.
Gently ping @amogh-jahagirdar @RussellSpitzer @rdblue
Sorry for the delay @advancedxy , I'll take a look at this first thing tomorrow morning!
@advancedxy I updated the PR to your branch https://github.com/advancedxy/iceberg/pull/1 in case there was still agreement on adding all of this metadata cleanup as part of snapshot expiration. (Sorry accidentally hit the close button, I meant to hit the comment button, just re-opened)
@nastra Thanks for reviewing, all your comments should be addressed.
@nastra @amogh-jahagirdar PTAL again, I think all your comments are addressed.
@amogh-jahagirdar @nastra since all comments are addressed, do you think it's ready to merge? I'm happy to address more comments if needed.
Thanks @advancedxy , I'll leave it open for a bit since @danielcweeks also mentioned he would take a look. But from my side everything LGTM.
@danielcweeks would you mind to take a look at this?
@rdblue @danielcweeks @amogh-jahagirdar would you mind to take a look at this again?
I'll go ahead and merge, the remaining comment(s) are something we can clean up in a follow on! Thank you for your patience and great work @advancedxy , really appreciate it! thanks to reviewers as well, @RussellSpitzer , @nastra @rdblue!
Thanks all for reviewing.
@advancedxy Thank you for your change and @gaborkaszab thank you for your change on https://github.com/apache/iceberg/pull/12089. I wonder are you aware of any work to further integrate this to Spark remove snapshots procedure?
@advancedxy Thank you for your change and @gaborkaszab thank you for your change on #12089. I wonder are you aware of any work to further integrate this to Spark remove snapshots procedure?
Hi @puchengy ,
Thanks for asking! I did bring this up on the dev@ list and I got some support, however there was one objection too. Here is the conversation. Would you mind showing your interest also on that dev@ conversation? I'll resume the conversation then giving people seem to be interested in this enhancement. Thanks!