Core: Fix the behavior of IncrementalFileCleanup when expire a snapshot
This PR fix the behavior of IncrementalFileCleanup, which will lead in useful data files being mistakenly cleaned up when expire a specified snapshot.
Fix issue: #10982
So when I wrote the Spark procedure for this we were already aware that this code path has a lot of potential issues. We end up basically completely rewriting the logic of detecting what is used and what is not used here which has made it a bit difficult to keep up to date. I wonder if now is the time to just re-implement it to work more like the Spark Action and just take the diff between reachable before and after.
That said I'll try to take a look at see if the fixes make sense.
@amogh-jahagirdar, @hantangwangd, I'm not sure that incremental cleanup is doing anything wrong here. Incremental cleanup deletes data files when the snapshot that removed them from the table is expired. If you want something safer, then the reachable strategy should be used.
I think there's a fair argument that you should not be able to use the API to so easily get in this state, but rather than changing how incremental cleanup works, I would recommend making it so the API doesn't allow expiring a specific snapshot by ID.
@hantangwangd I thought about this a bit more and while I still think updating the IncrementalFileCleanup to address this particular case is possible, I think the question more so is "Should we?".
The main problem we want to address is that the ExpireSnapshots implementation should ensure that a user doesn't end up in a state where they cleanup files that are still referenced. Looking at what's going on in Presto we're actually casting to RemoveSnapshots implementation and the implementation exposes withIncrementalCleanup.
The reason for exposing that in the implementation (and not the interface) was when we were doing branching/tagging we wanted to verify in tests that both cleanup strategies work as expected so it was easy to parameterize around that API. Ideally we wouldn't need to expose such a method and in the implementation intelligently choose the right strategy, but the validation from tests is really important so the method on the implementation class ended up being exposed.
So considering withIncrementalCleanup is exposed to someone who casts to RemoveSnapshots then we can either choose to complicate the Incremental cleanup with more logic to handle this specific case or we can accept that incremental cleanup has these limitations and from an API perspective ensure that callers don't shoot themselves in the foot and removing files which shouldn't be removed by triggering incremental cleanup when there's an arbitrary snapshot that's removed (what @rdblue suggested).
The incremental cleanup logic is already quite complex and I'm now thinking it's not really worth it to add handling this particular case. There's probably more cases and then if we were to keep adding handling for that at some point the implementation will look like the reachable cleanup anyways.
@hantangwangd would you be open to changing the approach to just ensure that if expireSnapshotId is called and incrementalCleanup is set to true we fail before triggering the file cleanup? Also if incremental cleanup isn't specified I think incrementalCleanup should be set to false in the case a specific snapshotID is specified in expireSnapshotId. That achieves the safety guarantees I think we want.
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L323
Thank you very much for your guidance @rdblue, it makes sense to avoid users falling into such a state in advance.
The incremental cleanup logic is already quite complex and I'm now thinking it's not really worth it to add handling this particular case. There's probably more cases and then if we were to keep adding handling for that at some point the implementation will look like the reachable cleanup anyways.
Yes, it is indeed the case, there's probably more cases. Considering the factors you mentioned, I agree with you that maybe it's better to accept that incremental cleanup has limitations and from an API perspective ensure that callers don't fall into a state where they cleanup files that are still referenced (as @rdblue suggested).
would you be open to changing the approach to just ensure that if expireSnapshotId is called and incrementalCleanup is set to true we fail before triggering the file cleanup? Also if incremental cleanup isn't specified I think incrementalCleanup should be set to false in the case a specific snapshotID is specified in expireSnapshotId. That achieves the safety guarantees I think we want.
Sure, I'd like to make this change, and modify the test cases accordingly. Thanks for all your explanations and suggestions @amogh-jahagirdar .
Following the discussions above, I changed the approach to just ensure that users will not fall into a state where they cleanup files that are still referenced:
- If
expireSnapshotIdis called andincrementalCleanupis set explicitly totrue, we fail before triggering the file cleanup. - If
expireSnapshotIdis called andincrementalCleanupisn't specified, we setincrementalCleanuptofalseso that reachable cleanup will be used.
Please take a look when convenient, thanks a lot! @rdblue @amogh-jahagirdar @RussellSpitzer