iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

API: Add RemoveUnusedSpecs in Table

Open advancedxy opened this issue 1 year ago • 6 comments

This is a continue work of #3462, all the credits should goes to @RussellSpitzer.

Previously there was no way to remove partition specs from a table once they were added. To fix this we add an api which searches through all reachable manifest files and records their specsIds. Any specIds which do not find are marked for removal which is done through a serializable commit.

advancedxy avatar Jul 23 '24 13:07 advancedxy

@amogh-jahagirdar @RussellSpitzer @aokolnychyi @szehon-ho it would be great if you guys could take a look at this.

This PR is raised based the previous discussion in https://github.com/apache/iceberg/pull/10352#discussion_r1605818799

advancedxy avatar Jul 23 '24 13:07 advancedxy

I don't think I'd go with a general SetPartitionSpecs update, I think I'd have a RemovePartitionSpec, and the TableMetadata builder APIs to remove a given spec (which will have validation that we're not removing the current spec, the spec to remove is a valid partition spec etc).

This is a nice suggestion and better approach. I will change the withSpecs/setSpecs by using removePartitionSpec.

Replying other comments inline.

  1. Should we included removing unused schemas?

No, and I think we are on the same page. PruneUnusedSchemas should be in a separate and dedicated action. However, it might not be possible to do that in current code as there's no SchemaId bounding to DataFile/DeleteFile. It's impossible to decide which schemas are unused?

In this approach I think we'd have to do a REST spec change to introduce the new update type. If we think this API is worth it for general purpose metadata cleaning, beyond preventing the drop column issue then we'd probably have to go through with the spec change. However, if this is only being introduced for the drop column issue, maybe we want to think through lighter weight options to solve that particular issue?

I don't think we should expose removePartitionSpec to the REST catalog, at least for now. I think RemoveUnusedSpecs is a general metadata cleaning API and worth introducing, however the API is self-contained and doesn't have to be coupled with REST catalog. Like the comment in the TableMetadata, it's not safe for external client to simply call removePartitionSpec without checking the spec is unused, which might not an easy task in the REST catalog. Unless we think it's worthy to add check by reading manifest files in the REST catalog, in that way, we may expose the removePartitionSpec to the REST catalog. That's why the org.apache.iceberg.MetadataUpdate.SetPartitionSpecs#applyTo method throws an exception instead of actually implementing it.

If introducing a new MetadataUpdate implies a REST spec change, I think we can change the TableMetadata.Builder#build to build without adding a RemovePartitionSpec change, how does that sounds to you?

advancedxy avatar Jul 24 '24 03:07 advancedxy

Close and re-open to trigger the CI.

Also gently ping @amogh-jahagirdar @RussellSpitzer to take another look.

advancedxy avatar Jul 26 '24 13:07 advancedxy

@advancedxy Just FYI, I've raised #10846 for the rest spec changes, I'll also start a mailing list discussion on the new update type.

amogh-jahagirdar avatar Aug 01 '24 19:08 amogh-jahagirdar

@advancedxy Just FYI, I've raised #10846 for the rest spec changes, I'll also start a mailing list discussion on the new update type.

Thank you for taking the effort, that's great.

advancedxy avatar Aug 02 '24 02:08 advancedxy

@amogh-jahagirdar @RussellSpitzer and @rdblue since #10846 was already merged, let's restart this PR. Do you think we should add the RemovePartitionSpec support in the REST catalog in this PR as well or we can do that in a follow-up PR?

advancedxy avatar Aug 29 '24 02:08 advancedxy

Gently ping @amogh-jahagirdar @rdblue @RussellSpitzer

advancedxy avatar Sep 18 '24 02:09 advancedxy

Sorry for missing following up on this @advancedxy yes I think we should send the new update type as part of this PR.that would require some parser changes as well for the update type. Let me know if any help is needed, I can try taking a look at this tomorrow if we wanted to get in the parser changes first and then rebase this PR on top of it so this PR is more focused on the operation parts.

amogh-jahagirdar avatar Sep 20 '24 01:09 amogh-jahagirdar

yes I think we should send the new update type as part of this PR.that would require some parser changes as well for the update type. Let me know if any help is needed, I can try taking a look at this tomorrow if we wanted to get in the parser changes first

OK, no problem. Thanks for offering the help, I think I can add these relevant codes in this PR as well. We can split the parser changes as a separate PR if needed.

advancedxy avatar Sep 20 '24 02:09 advancedxy

@amogh-jahagirdar Updated, It would be great if you can take another look at this.

advancedxy avatar Sep 23 '24 08:09 advancedxy

Gently ping @amogh-jahagirdar @RussellSpitzer @rdblue

advancedxy avatar Sep 30 '24 06:09 advancedxy

Sorry for the delay @advancedxy , I'll take a look at this first thing tomorrow morning!

amogh-jahagirdar avatar Oct 01 '24 03:10 amogh-jahagirdar

@advancedxy I updated the PR to your branch https://github.com/advancedxy/iceberg/pull/1 in case there was still agreement on adding all of this metadata cleanup as part of snapshot expiration. (Sorry accidentally hit the close button, I meant to hit the comment button, just re-opened)

amogh-jahagirdar avatar Oct 31 '24 18:10 amogh-jahagirdar

@nastra Thanks for reviewing, all your comments should be addressed.

advancedxy avatar Nov 18 '24 12:11 advancedxy

@nastra @amogh-jahagirdar PTAL again, I think all your comments are addressed.

advancedxy avatar Nov 25 '24 02:11 advancedxy

@amogh-jahagirdar @nastra since all comments are addressed, do you think it's ready to merge? I'm happy to address more comments if needed.

advancedxy avatar Dec 04 '24 15:12 advancedxy

Thanks @advancedxy , I'll leave it open for a bit since @danielcweeks also mentioned he would take a look. But from my side everything LGTM.

amogh-jahagirdar avatar Dec 04 '24 16:12 amogh-jahagirdar

@danielcweeks would you mind to take a look at this?

advancedxy avatar Dec 13 '24 12:12 advancedxy

@rdblue @danielcweeks @amogh-jahagirdar would you mind to take a look at this again?

advancedxy avatar Jan 06 '25 14:01 advancedxy

I'll go ahead and merge, the remaining comment(s) are something we can clean up in a follow on! Thank you for your patience and great work @advancedxy , really appreciate it! thanks to reviewers as well, @RussellSpitzer , @nastra @rdblue!

amogh-jahagirdar avatar Jan 07 '25 18:01 amogh-jahagirdar

Thanks all for reviewing.

advancedxy avatar Jan 09 '25 03:01 advancedxy

@advancedxy Thank you for your change and @gaborkaszab thank you for your change on https://github.com/apache/iceberg/pull/12089. I wonder are you aware of any work to further integrate this to Spark remove snapshots procedure?

puchengy avatar May 10 '25 14:05 puchengy

@advancedxy Thank you for your change and @gaborkaszab thank you for your change on #12089. I wonder are you aware of any work to further integrate this to Spark remove snapshots procedure?

Hi @puchengy ,

Thanks for asking! I did bring this up on the dev@ list and I got some support, however there was one objection too. Here is the conversation. Would you mind showing your interest also on that dev@ conversation? I'll resume the conversation then giving people seem to be interested in this enhancement. Thanks!

gaborkaszab avatar May 12 '25 10:05 gaborkaszab