delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request] Bucketing implementation in Delta Lake

Open wudanzy opened this issue 6 months ago • 2 comments

Feature request

Which Delta project/connector is this regarding?

  • [x] Spark
  • [ ] Standalone
  • [ ] Flink
  • [ ] Kernel
  • [ ] Other (fill in here)

Overview

Implement bucketing in Delta lake to speed up aggregation and join cases.

Motivation

Currently, I found that Delta Lake doesn’t support bucketing. This leads to inefficiency for two kinds of use cases:

  • Reduce operation on bucketing columns. If we can have a delta table bucketed by a list of columns, then any aggregation operation on those columns can be speeded up. Without bucketing information, Spark will incur an expensive shuffle operation.
  • Join two tables bucketed in the same way. If we want to join two tables and those two tables are bucketed in the same way, spark can plan it as a MergeSortJoin. Otherwise, an expensive shuffle is needed.

The bucketing was proposed in spark to solve the above problems (see original JIRA and design), so spark has supported bucketing for several years. However, the delta lake does not support bucketing. Delta lake has developed features Z-ordering and liquid clustering, but both features are for data skipping, so both features cannot help avoiding unnecessary shuffles in aggregation & joins.

Further details

The design is here.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • [ ] Yes. I can contribute this feature independently.
  • [x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • [ ] No. I cannot contribute this feature at this time.

wudanzy avatar Aug 07 '24 03:08 wudanzy