delta-rs
delta-rs copied to clipboard
Parallel Vacuum command
Description
Currently, the vacuum command deletes files one by one which is very slow on e.g. S3, especially if you have 100000s files. I had a case (with databricks/spark) with more than 8 million stale files, which took days, even with parallel calls (using spark.databricks.delta.vacuum.parallelDelete.enabled I got around 80 deletes / second)
The delete calls could be parallelized (e.g. 100/1000/10k concurrent deletes) to speed up the processing.
Use Case More performant vacuum.
Related Issue(s)
If only there is a rust distributed batch compute framework that we can leverage here ;)
On a serious note, I fully agree with you that we should parallelize the delete calls. We should be able to easily do couple thousands calls from a single rust process async, which should help bring down the vacuum time to under an hour for 8M items.
It might also be wise/a good start to utilize the storage APIs better, e.g. such as using multi object delete which does up to 1000 deletes in one call: https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Good call @Dandandan , filed https://github.com/delta-io/delta-rs/issues/394 for the batch delete support. With batch delete, we should be able to get it down from 1 hour to minutes!