lakeFS
lakeFS copied to clipboard
Hard-delete objects that were never committed
User story
In general, objects should be removed from the storage when there are no valid pointers from lakeFS to them. In particular, objects that were removed from the staging area without ever being committed should be hard-deleted from the underlying storage.
Requirements
In case an object exists only in the staging area of a branch (does not belong to any commit) - hard-delete it immediately when it has no lakeFS entries that use it as a physical address. This may happen when:
- The object is deleted in lakeFS.
- The object is overwritten in lakeFS.
- All uncommitted changes on the branch are reverted.
- The branch is deleted.
- The repository is deleted.
Important: the implementation should take into account that lakeFS sometimes uses the same object for multiple entries (AKA deduplication).
Proposal
The requirements can be kept by the following:
- When removing staged data (by delete or by revert) delete the data from underlying storage as well (*this could be done only if deduplication works only on committed data)
- When removing a branch delete all staged data from underlying storage.
- TBD - handle Objects that where overwritten within staging
Summary of meeting (@nopcoder & @arielshaqed):
Collect in...
Online: (enhanced) lakeFS+ | Batch: (enhanced) GC |
---|---|
Delete everything **lakeFS knows about**. | Delete everything under a storage namespace that is not referenced |
|
|
Decision
Overwritten and deleted objects on staging areas are a lakeFS (versioning) issue, that is probably easiest to handle on the GC process. We assume that it is OK to handle object removal in GC only: users who care about compliance anyway need to have GC as well; users who care about storage will not have that many files overwritten on staging (and if they do, we can add a "best-effort" patch to lakeFS to try to delete the old file if it is in the current staging area of the storage namespace on overwrite or delete).
So this is a joint @treeverse/versioning + @treeverse/ecosystem effort. Versioning should lead (and design): most unknowns are on the versioning side, and it probably needs to be designed for the KV world. Eco can supply knowledge of how to parallelize lakeFS access without overloading it and how to use Spark to compute huge diffs in the physical address world, as well as parallel bulk delete codes. The implementation itself might be in the same GC job or in a different jobs.
Why S3 Lifecycle policies won't work
- Non-AWS S3 backends might not even have lifecycle (although Minio has lifecycle management)
- Separate implementation on S3, Azure (which might split into blobstore and datalake!), GCS
- Tags and deletes are not transactional, making it very hard for code guarantee any dropped file will be deleted.
Adding my 2 cents. lakeFS gets criticized for not deleting uncommitted objects from users throughout the funnel, even if they are just evaluating it. I guess they are more sensitive to uncommitted garbage than committed garbage. We know that most users don't bother to configure their lakeFS GC since it's not that trivial for everyone. Given the way that Spark works, I understand users frustration with 3x data being stored for Spark jobs. While I understand the limitations of an online-only solution, I do think that we need to think of an online way to prevent uncommitted garbage, even if it's not hermetic and requires the users to run GC for 100% cleanups.
Any news about when this issue will be fixed? I am very interested in this one
Hey @isabela-angelo, we started the design for this one. It's not that trivial as a proper cleanup will require the user to run some offline Spark GC job with information on the storage from the lakeFS server. We try to reduce the friction of running an external job as much as possible. I estimate it will be ready by Q4.
Nice! Thanks for the reply
@johnnyaug a requirements question: Do we want to clear uncommitted garbage on every GC job run? I mean, should this happen by default or by a user configuration? The first makes more sense to me, but I will be happy to get your input.