lakeFS Hard-delete objects that were never committed

User story

In general, objects should be removed from the storage when there are no valid pointers from lakeFS to them. In particular, objects that were removed from the staging area without ever being committed should be hard-deleted from the underlying storage.

Requirements

In case an object exists only in the staging area of a branch (does not belong to any commit) - hard-delete it immediately when it has no lakeFS entries that use it as a physical address. This may happen when:

The object is deleted in lakeFS.
The object is overwritten in lakeFS.
All uncommitted changes on the branch are reverted.
The branch is deleted.
The repository is deleted.

Important: the implementation should take into account that lakeFS sometimes uses the same object for multiple entries (AKA deduplication).

Proposal

The requirements can be kept by the following:

When removing staged data (by delete or by revert) delete the data from underlying storage as well (*this could be done only if deduplication works only on committed data)
When removing a branch delete all staged data from underlying storage.
TBD - handle Objects that where overwritten within staging

May 10 '21 07:05 johnnyaug

whiteboard summary of lakeFS online versus batch "fsck" tracking of overwritten files in staging; see table below for expanded readable text!

Summary of meeting (@nopcoder & @arielshaqed):

Collect in...

Online: (enhanced) lakeFS+	Batch: (enhanced) GC
Delete everything lakeFS knows about.	Delete everything under a storage namespace that is not referenced
Pros Works for all users. Cons Requires one-time upgrade (which will be a simple GC-like process!) Not fsck: cannot delete a file that lakeFS does not know about (for whatever reason). Slower (online!) put.	Pros Trivial upgrade Simpler to code Cons Needs to understand special "directories" used by lakeFS _not_ for staging Must `ls -R` (or inventory). In future might have a smaller listing if we scan by all staging tokens, e.g. track "live" staging tokens (everything uncommitted + committed in past week). Requires Spark (or a separate non-Spark implementation) Slower. But note that if GC is enabled (and it will be whenever people care about forgetting the past and/or total space!), then uncollected lost files are a fixed proportion of all stored files! Still hits staging (requires new lakeFS API to find or work by staging tokens and also to do this efficiently)

Online: (enhanced) lakeFS+

Batch: (enhanced) GC

Delete everything **lakeFS knows about**.

Delete everything under a storage namespace that is not referenced

Pros
- Works for all users.
Cons
- Requires one-time upgrade (which will be a simple GC-like process!)
- Not fsck: cannot delete a file that lakeFS does not know about (for whatever reason).
- Slower (online!) put.

Pros
- Trivial upgrade
- Simpler to code
Cons
- Needs to understand special "directories" used by lakeFS _not_ for staging
- Must `ls -R` (or inventory). In future might have a smaller listing if we scan by all staging tokens, e.g. track "live" staging tokens (everything uncommitted + committed in past week).
- Requires Spark (or a separate non-Spark implementation)
- Slower. But note that if GC is enabled (and it will be whenever people care about forgetting the past and/or total space!), then uncollected lost files are a fixed proportion of all stored files!
- Still hits staging (requires new lakeFS API to find or work by staging tokens and also to do this efficiently)

Decision

Overwritten and deleted objects on staging areas are a lakeFS (versioning) issue, that is probably easiest to handle on the GC process. We assume that it is OK to handle object removal in GC only: users who care about compliance anyway need to have GC as well; users who care about storage will not have that many files overwritten on staging (and if they do, we can add a "best-effort" patch to lakeFS to try to delete the old file if it is in the current staging area of the storage namespace on overwrite or delete).

So this is a joint @treeverse/versioning + @treeverse/ecosystem effort. Versioning should lead (and design): most unknowns are on the versioning side, and it probably needs to be designed for the KV world. Eco can supply knowledge of how to parallelize lakeFS access without overloading it and how to use Spark to compute huge diffs in the physical address world, as well as parallel bulk delete codes. The implementation itself might be in the same GC job or in a different jobs.

Why S3 Lifecycle policies won't work

Non-AWS S3 backends might not even have lifecycle (although Minio has lifecycle management)
Separate implementation on S3, Azure (which might split into blobstore and datalake!), GCS
Tags and deletes are not transactional, making it very hard for code guarantee any dropped file will be deleted.

Jul 13 '22 09:07 arielshaqed

Adding my 2 cents. lakeFS gets criticized for not deleting uncommitted objects from users throughout the funnel, even if they are just evaluating it. I guess they are more sensitive to uncommitted garbage than committed garbage. We know that most users don't bother to configure their lakeFS GC since it's not that trivial for everyone. Given the way that Spark works, I understand users frustration with 3x data being stored for Spark jobs. While I understand the limitations of an online-only solution, I do think that we need to think of an online way to prevent uncommitted garbage, even if it's not hermetic and requires the users to run GC for 100% cleanups.

Jul 17 '22 10:07 itaiad200

Any news about when this issue will be fixed? I am very interested in this one

Aug 08 '22 15:08 isabela-angelo

Hey @isabela-angelo, we started the design for this one. It's not that trivial as a proper cleanup will require the user to run some offline Spark GC job with information on the storage from the lakeFS server. We try to reduce the friction of running an external job as much as possible. I estimate it will be ready by Q4.

Aug 08 '22 16:08 itaiad200

Nice! Thanks for the reply

Aug 08 '22 17:08 isabela-angelo

@johnnyaug a requirements question: Do we want to clear uncommitted garbage on every GC job run? I mean, should this happen by default or by a user configuration? The first makes more sense to me, but I will be happy to get your input.

Aug 11 '22 12:08 talSofer

lakeFS lakeFS copied to clipboard

Hard-delete objects that were never committed

User story

Requirements

Proposal

Summary of meeting (@nopcoder & @arielshaqed):

Decision

Why S3 Lifecycle policies won't work

lakeFS
lakeFS copied to clipboard