iceberg Core: Fix the behavior of IncrementalFileCleanup when expire a snapshot

This PR fix the behavior of IncrementalFileCleanup, which will lead in useful data files being mistakenly cleaned up when expire a specified snapshot.

Fix issue: #10982

Aug 21 '24 06:08 hantangwangd

So when I wrote the Spark procedure for this we were already aware that this code path has a lot of potential issues. We end up basically completely rewriting the logic of detecting what is used and what is not used here which has made it a bit difficult to keep up to date. I wonder if now is the time to just re-implement it to work more like the Spark Action and just take the diff between reachable before and after.

That said I'll try to take a look at see if the fixes make sense.

Aug 28 '24 06:08 RussellSpitzer

@amogh-jahagirdar, @hantangwangd, I'm not sure that incremental cleanup is doing anything wrong here. Incremental cleanup deletes data files when the snapshot that removed them from the table is expired. If you want something safer, then the reachable strategy should be used.

I think there's a fair argument that you should not be able to use the API to so easily get in this state, but rather than changing how incremental cleanup works, I would recommend making it so the API doesn't allow expiring a specific snapshot by ID.

Aug 28 '24 23:08 rdblue

@hantangwangd I thought about this a bit more and while I still think updating the IncrementalFileCleanup to address this particular case is possible, I think the question more so is "Should we?".

The main problem we want to address is that the ExpireSnapshots implementation should ensure that a user doesn't end up in a state where they cleanup files that are still referenced. Looking at what's going on in Presto we're actually casting to RemoveSnapshots implementation and the implementation exposes withIncrementalCleanup.

The reason for exposing that in the implementation (and not the interface) was when we were doing branching/tagging we wanted to verify in tests that both cleanup strategies work as expected so it was easy to parameterize around that API. Ideally we wouldn't need to expose such a method and in the implementation intelligently choose the right strategy, but the validation from tests is really important so the method on the implementation class ended up being exposed.

So considering withIncrementalCleanup is exposed to someone who casts to RemoveSnapshots then we can either choose to complicate the Incremental cleanup with more logic to handle this specific case or we can accept that incremental cleanup has these limitations and from an API perspective ensure that callers don't shoot themselves in the foot and removing files which shouldn't be removed by triggering incremental cleanup when there's an arbitrary snapshot that's removed (what @rdblue suggested).

The incremental cleanup logic is already quite complex and I'm now thinking it's not really worth it to add handling this particular case. There's probably more cases and then if we were to keep adding handling for that at some point the implementation will look like the reachable cleanup anyways.

@hantangwangd would you be open to changing the approach to just ensure that if expireSnapshotId is called and incrementalCleanup is set to true we fail before triggering the file cleanup? Also if incremental cleanup isn't specified I think incrementalCleanup should be set to false in the case a specific snapshotID is specified in expireSnapshotId. That achieves the safety guarantees I think we want.

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L323

Aug 29 '24 05:08 amogh-jahagirdar

Thank you very much for your guidance @rdblue, it makes sense to avoid users falling into such a state in advance.

Aug 29 '24 06:08 hantangwangd

The incremental cleanup logic is already quite complex and I'm now thinking it's not really worth it to add handling this particular case. There's probably more cases and then if we were to keep adding handling for that at some point the implementation will look like the reachable cleanup anyways.

Yes, it is indeed the case, there's probably more cases. Considering the factors you mentioned, I agree with you that maybe it's better to accept that incremental cleanup has limitations and from an API perspective ensure that callers don't fall into a state where they cleanup files that are still referenced (as @rdblue suggested).

would you be open to changing the approach to just ensure that if expireSnapshotId is called and incrementalCleanup is set to true we fail before triggering the file cleanup? Also if incremental cleanup isn't specified I think incrementalCleanup should be set to false in the case a specific snapshotID is specified in expireSnapshotId. That achieves the safety guarantees I think we want.

Sure, I'd like to make this change, and modify the test cases accordingly. Thanks for all your explanations and suggestions @amogh-jahagirdar .

Aug 29 '24 06:08 hantangwangd

Following the discussions above, I changed the approach to just ensure that users will not fall into a state where they cleanup files that are still referenced:

If expireSnapshotId is called and incrementalCleanup is set explicitly to true, we fail before triggering the file cleanup.
If expireSnapshotId is called and incrementalCleanup isn't specified, we set incrementalCleanup to false so that reachable cleanup will be used.

Please take a look when convenient, thanks a lot! @rdblue @amogh-jahagirdar @RussellSpitzer

Aug 29 '24 12:08 hantangwangd