lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Support Garbage Collection on local storage

Open praegustator opened this issue 1 year ago • 1 comments

As mentioned in the docs, the current Garbage Collection is supported only on S3 and Azure and will be at some point available for GCP. However, I did not find any plans regarding GC on local storage.

Are there plans for it?

praegustator avatar Aug 10 '22 15:08 praegustator

Hi, @praegustator!

You are absolutely correct, there are no current plans to support GC on local storage. We do not believe in local storage for production use-cases (and lakeFS warns you on startup with a configured local backing store).

I would be very happy to hear more about your use-case! Please reach out on our Slack if you're willing.

I'd like to explain why we don't really support local storage in production. lakeFS gives Git-like semantics to object stores. By their nature, local stores are file stores not object stores; this blog covers some of the differences. Object stores offer very different semantics to file stores: they are efficient for large files (10MiB) and inefficient for small files, they are efficient for "recursive" listings and inefficient for "directory" listings, etc. In return, object stores are cheap for massive data, allow for efficient backup, and work well for distributed systems. Meanwhile large file stores are typically (much) more expensive and don't work well for distributed systems.

The current implementation of garbage collection is a case in point: it is a Spark application. And just using Spark on a file store will be hard, especially if the file store is large.

To be perfectly clear: you could put e.g. MinIO (or Ceph) in front of a large file store to give it S3 semantics, and use that as a backing store for lakeFS. Tuning MinIO to efficiently support a large object store should be possible; I'd reach out to min.io for details on how to do that.

arielshaqed avatar Aug 10 '22 15:08 arielshaqed

Hi @praegustator, We'll close the issue for now as we don't find a specific use case for GC for a local file system. If you still want to use GC locally, you can put a MinIO or Ceph in front of it as @arielshaqed suggested.

Feel free to reopen it if you still think that it's necessary. Thank you!

Jonathan-Rosenberg avatar Oct 02 '22 14:10 Jonathan-Rosenberg