lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

POC: Garbage collection on Parquet files

Open johnnyaug opened this issue 3 years ago • 0 comments
trafficstars

Resolves: #4157.

This PR demonstrates how we can use the lakeFS metadata client to create a Parquet table of a repository's ranges. Hopefully, running GC over Parquet files will gain a significant performance boost. It may even make some of our optimizations redundant, turning GC into a simple anti-join (see deleted code in the changes).

Notes

  1. In RepositoryConverter.scala we write only ranges that were added to the Parquet directory. This is done by using a join on the range_id column. This join will probably not be efficient. We can solve this by either:
    • (easy) enhance the metadata client by allowing to exclude ranges using a configuration.
    • (harder, more correct) implement a LakeFSFileFormat. It will be used to push down the range filter to the InputFormat (see how Parquet does it). To clarify, this is not a huge effort, but it's still bigger than the easy way.
  2. Schema evolution: if we add fields to this Parquet, we need to re-copy everything. That is, need to have some kind of primitive schema versioning.

johnnyaug avatar Sep 13 '22 16:09 johnnyaug