delta
delta copied to clipboard
[Feature Request] Bucketing implementation in Delta Lake
Feature request
Which Delta project/connector is this regarding?
- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)
Overview
Implement bucketing in Delta lake to speed up aggregation and join cases.
Motivation
Currently, I found that Delta Lake doesn’t support bucketing. This leads to inefficiency for two kinds of use cases:
- Reduce operation on bucketing columns. If we can have a delta table bucketed by a list of columns, then any aggregation operation on those columns can be speeded up. Without bucketing information, Spark will incur an expensive shuffle operation.
- Join two tables bucketed in the same way. If we want to join two tables and those two tables are bucketed in the same way, spark can plan it as a MergeSortJoin. Otherwise, an expensive shuffle is needed.
The bucketing was proposed in spark to solve the above problems (see original JIRA and design), so spark has supported bucketing for several years. However, the delta lake does not support bucketing. Delta lake has developed features Z-ordering and liquid clustering, but both features are for data skipping, so both features cannot help avoiding unnecessary shuffles in aggregation & joins.
Further details
The design is here.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- [ ] Yes. I can contribute this feature independently.
- [x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- [ ] No. I cannot contribute this feature at this time.