delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Implement vacuum command

Open xianwill opened this issue 3 years ago • 4 comments

delta-rs should have a "vacuum table" utility analogous to the one provided by the open source Spark Delta Lake implementation. This utility is useful for cleaning up old files that are no longer referenced by the delta log (e.g. files rewritten by merge statements, optimize command etc.).

See the VacuumCommand in the open source implementation for reference.

xianwill avatar Mar 01 '21 14:03 xianwill

@fvaleye I think this is actually done right? I'm not clear what work we have left to do

rtyler avatar May 17 '21 02:05 rtyler

@fvaleye I think this is actually done right? I'm not clear what work we have left to do

Yes, it is already implemented! Hum, we need to improve the tests suite: https://github.com/delta-io/delta-rs/issues/227

fvaleye avatar May 17 '21 07:05 fvaleye

Как насчет поправить документацию? Наверное стоит подновить с вот этого image на вот это? image

MironAtHome avatar Nov 03 '21 11:11 MironAtHome

@rtyler @fvaleye It looks like there are still two serious issues with vacuum implementation:

  • vacuum lists all files in dataset using StorageBackend.list_objs. The problem is this function returns all files (including these in subdirectories) on s3 backend and gcs backend (althrough I'm not sure about gcs). On file and azure backends this function lists only first-level files (without recursing to subdirectories).
  • vacuum ignores files not referenced by delta log at all (so not included on DeltaTableState.files() and DeltaTableState.all_tombstones() lists).

mrk-its avatar Jul 03 '22 21:07 mrk-its

Resolved by #669.

wjones127 avatar Sep 28 '22 02:09 wjones127