lakeFS
lakeFS copied to clipboard
POC: Garbage collection on Parquet files
trafficstars
Resolves: #4157.
This PR demonstrates how we can use the lakeFS metadata client to create a Parquet table of a repository's ranges. Hopefully, running GC over Parquet files will gain a significant performance boost. It may even make some of our optimizations redundant, turning GC into a simple anti-join (see deleted code in the changes).
Notes
- In RepositoryConverter.scala we write only ranges that were added to the Parquet directory. This is done by using a join on the range_id column. This join will probably not be efficient. We can solve this by either:
- (easy) enhance the metadata client by allowing to exclude ranges using a configuration.
- (harder, more correct) implement a LakeFSFileFormat. It will be used to push down the range filter to the InputFormat (see how Parquet does it). To clarify, this is not a huge effort, but it's still bigger than the easy way.
- Schema evolution: if we add fields to this Parquet, we need to re-copy everything. That is, need to have some kind of primitive schema versioning.