delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Implement optimize command

Open xianwill opened this issue 4 years ago • 6 comments

Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.

I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false in the delta log entry. An MVP might look like:

  • Query the delta log for file stats relevant to the current optimize run (based on predicate)
  • Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
  • Combine files by looping through each input file and writing its record batches into the new output file.
  • Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.

Files no longer relavant to the log may be cleaned up later by vacuum (see https://github.com/delta-io/delta-rs/issues/97)

xianwill avatar Mar 01 '21 18:03 xianwill

While vacuum is a big step forward, it does not obviate need for optimize. These two commands have different purpose, and ( looking further out ) optimize will eventually become part of either batch or background worker, continuously operating with lower priority, as is common to all relational engines.

MironAtHome avatar Nov 03 '21 11:11 MironAtHome

@xianwill - The steps you've outlined sound good to me and think small file compaction via OPTIMIZE would be a great addition to this library. Also agree that the first pass should ignore z order indexing. Do you know if this work is planning on getting prioritized anytime soon?

MrPowers avatar Jun 01 '22 13:06 MrPowers

@MrPowers this is actively being worked on. Did you see #607?

wjones127 avatar Jun 01 '22 16:06 wjones127

@wjones127 - oh, wow, looks like amazing progress is being made, so exciting!

MrPowers avatar Jun 01 '22 17:06 MrPowers

@wjones127 - looks like #607 was merged!! Will it be relatively easy to expose this functionality via the Python bindings?

MrPowers avatar Jun 04 '22 17:06 MrPowers

@MrPowers Yeah shouldn't be too difficult. I created #622 to track that.

wjones127 avatar Jun 04 '22 18:06 wjones127

Resolved by #607.

wjones127 avatar Sep 28 '22 03:09 wjones127