delta-rs
delta-rs copied to clipboard
Implement optimize command
Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.
I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false in the delta log entry. An MVP might look like:
- Query the delta log for file stats relevant to the current optimize run (based on predicate)
- Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
- Combine files by looping through each input file and writing its record batches into the new output file.
- Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.
Files no longer relavant to the log may be cleaned up later by vacuum (see https://github.com/delta-io/delta-rs/issues/97)
While vacuum is a big step forward, it does not obviate need for optimize. These two commands have different purpose, and ( looking further out ) optimize will eventually become part of either batch or background worker, continuously operating with lower priority, as is common to all relational engines.
@xianwill - The steps you've outlined sound good to me and think small file compaction via OPTIMIZE would be a great addition to this library. Also agree that the first pass should ignore z order indexing. Do you know if this work is planning on getting prioritized anytime soon?
@MrPowers this is actively being worked on. Did you see #607?
@wjones127 - oh, wow, looks like amazing progress is being made, so exciting!
@wjones127 - looks like #607 was merged!! Will it be relatively easy to expose this functionality via the Python bindings?
@MrPowers Yeah shouldn't be too difficult. I created #622 to track that.
Resolved by #607.