[Feature Request] Make OPTIMIZE ZORDER BY skip partitions selectively if the most recent version was already optimized
Feature request
Overview
OPTIMIZE ZODER BY let users influence the order of rows in the persisted parquet files allowing data skipping on read. Currently every time it is executed, all files of the delta table are written again - even if the same OPTIMIZE (with the same ZORDER BY columns) was just executed recently. I'd appreciate it if the OPTIMIZE ZODER BY was "smart" in that regards, that if the most recent version was created by the very same OPTIMIZE, it skips the operation for the table/partition.
Motivation
This feature would decrease unnecessary disc writes, especially if the table/ lot of partitions often don't have "movement" (inserts, deletes, updates) in it. Currently, users can manually specify on which partitions the OPTIMIZE ZORDER BY should be executed on (by specifying the WHERE statement), but this requires knowledge about the table/partitions. It would be very convenient, if delta itself would recognize that another OPTIMIZE on the table / some partition would not have any impact and skip it.
Further details
For implementing this feature, before executing an OPTIMZIE ZODER BY delta would need to look into the version history and identify what was the last operation on the table / each partition. If it was a OPTIMIZE also the ZORDER BY columns would have to be regarded. If it is the same as the new OPTIMZIE ZODER BY it could be skiped for the respective partitions or the entire table.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- [ ] Yes. I can contribute this feature independently.
- [ ] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- [x] No. I cannot contribute this feature at this time.
Thanks for suggesting this, we'll leave this open and see if anyone would like to work on it.
I'm taking a look into this, currently tinkering with the OptimizeExecutor class (not 100% sure its the right class, still learning about the code base). ~May I open a draft PR in the future?~ @amirmor1 is already working on it.
Question -> What if the last operation on that table was not OPTIMIZE but WRITE (or any other operation), but in the history there is somewhere OPTIMIZE operation (with same zOrderBy columns). Can we assume that the new OPTIMIZE operation should be executed only on any "touched" partition after the previous OPTIMIZE operation? Also, what should happen in case a user provided predicate partition filter -> Should delta try to be "smart" again (which opens a whole new set of questions) or should it do whatever the user requested, even though there is no real need to do OPTIMIZE again on that partition?
Question -> What if the last operation on that table was not OPTIMIZE but WRITE (or any other operation), but in the history there is somewhere OPTIMIZE operation (with same zOrderBy columns). Can we assume that the new OPTIMIZE operation should be executed only on any "touched" partition after the previous OPTIMIZE operation?
This makes sense to me. Implementing this may be tricky. Essentially you are saying
- Suppose we have partitions 1,2,3,4
- Suppose at version N-5 an OPTIMIZE command optimized partitions 1, 2
- Suppose at between versions N-4 and N, WRITES were added to partition 2 only
- Then if we run an OPTIMIZE command for version N+1, we should optimize partitions 2, 3, 4. Not partition 1, since there have been no changes to it since the last optimize
@amirmor1 I'd prefer a "smart" logic also if a partition predicate is provided. Currently I don't see a use case where a user would like to have the whole table / some partitions explicitly be rewritten if there is nothing to gain from the rewrite. I think the benefits (faster execution, less storage utilization, less I/O) would be most welcome 😄
Thanks for the comments guys.
Another question -> What should happen in case when configuration change between two OPTIMIZE operation (with or without WRITE operations between the two OPTIMIZE operations). I would expect that even two sequential OPTIMIZE operations (without any WRITE between them), with DELTA_OPTIMIZE_MIN_FILE_SIZE change (for example) in between, should do a full optimize.
As @scottsand-db mentioned its a bit tricky to implement the above, but I think its even trickier with keeping OPTIMIZE configuration state management.
Good point! Haven't thought about that... 😖
Would it be possible to have the "smart" behaviour as default (but not considering configuration changes) and some additional way (option/parameter) to explicitly force the rewrite?
In that scenario it would be the obligation of the users to rebuild the Delta Tables after config changes with the force option.
@keen85 or @amirmor1 - are either of you interested in contributing this feature? or, perhaps, proposing the API and offering a high-level design?
@scottsand-db I would like to work on this feature. However, I think I'll probably gonna need some help / guidance. I can start with some high level design and later with the implementation. Do you have some template for high level design doc?
@amirmor1 awesome! Here are a few examples: Trino/Presto Connector, Delta Standalone Writer, Apache Flink Sink, S3 Multi-cluster writes
Hi, I've created a high level design, describing the issue and the approach I was thinking on tackling it. I admit that I already started playing a bit with the implementation of it, mainly to better understand what I'm facing with. Anyway, any idea/comment will be welcome. I would like to thank @gabip253, @GrigorievNick and @Silverlight42 for taking a look at this doc, giving comments and asking important questions. https://docs.google.com/document/d/1oUZN9F1gCptKGAbXtrCjkOl-STBePxbiiWvlcjpb91U/edit?usp=sharing
@amirmor1 the provided link does not work for me
@keen85 Sorry about that, fixed it.
Thanks for the design doc! This will take some time to review, but I am looking forward to it.
Left some comments. Coming long!
@scottsand-db Thanks! I've updated the document and left some comments.
Hi guys, any update on this? Any questions regarding the design doc? I've provided 4 solutions and I would like to start work on one of them.
Hi @scottsand-db, any update on this?