delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request] Improve VACUUM performance

Open mrk-its opened this issue 3 years ago • 1 comments
trafficstars

Feature request

Overview

Current VACUUM implementation sometimes is very inefficient / slow because of few reasons:

  • First phase of vacuum lists all files. It is done in parallel, but concurrency is limited by number of top-level partitions in the dataset. So if dataset has 2 top-level partitions, only two parallel spark jobs will list all the files in the dataset.
  • Collecting list of all files in the dataset is implemented using LogStore.listFrom - it is called recursively for each directory, so for datasets with huge number of small partitions it leads to huge number of listFrom calls. It is slow and for some storages it also leads to bigger costs (for example on S3 with Mutli Cluster Setup with S3DynamoDBLogStore each listFrom makes also DynamoDB request)

Motivation

Current VACUUM implementation is simply not usable sometimes (vacuum job may take days to complete).

Further details

Possible improvements:

  • for small number of top-level partition we may go deeper until enough of directory entries are found.
  • directly use org.apache.hadoop.fs.FileSystem.listFiles(path, recursive=true) instead of LogStore.listFrom

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • [ ] Yes. I can contribute this feature independently.
  • [ ] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • [ ] No. I cannot contribute this feature at this time.

mrk-its avatar Jul 03 '22 14:07 mrk-its

Hi @mrk-its - are you interested in proposing a more detailed fix? Perhaps a design doc, and we can give you some feedback and guidance?

scottsand-db avatar Aug 25 '22 23:08 scottsand-db

Can we identify the files that need to be physically removed based on the deleted/removed files from delta log .json file? Once the files are identified, the delete operation can then be applied. I believe that based on this strategy both the file identification and deletion steps could safely run in parallel.

P.S. I do not know the exact specifics of the current implementation

UsmanYasin avatar Sep 27 '22 15:09 UsmanYasin

Can we identify the files that need to be physically removed based on the deleted/removed files from delta log .json file?

Files written by a failed job will not be tracked anywhere. So we have to list files.

zsxwing avatar Sep 27 '22 17:09 zsxwing