spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction

Open jlowe opened this issue 1 year ago • 5 comments

https://docs.databricks.com/en/delta/tune-file-size.html states that Delta Lake MERGE, UPDATE, and DELETE operations will always trigger optimized write and auto compaction behavior as of 10.4 LTS, and this cannot be disabled. The RAPIDS Accelerator forms of these operations should mimic this behavior.

jlowe avatar Feb 13 '24 20:02 jlowe

Note that this also should remove the repartition by partition key for partitioned tables when writing a MERGE because we're going to turn around and repartition for the optimized write anyway.

jlowe avatar Feb 13 '24 22:02 jlowe

Note that for MERGE the user can specify spark.databricks.delta.merge.repartitionBeforeWrite.enabled=false to avoid repartitioning by the partition key when doing a merge into a few number of partitions to avoid sending all the write data to just a small number of tasks. Not exactly semantically equivalent to optimize write and auto compact, but it can avoid the terrible write performance for that partitioned write case.

jlowe avatar Apr 02 '24 22:04 jlowe

Hi, @jlowe delta oss have added support for optimized write: https://github.com/delta-io/delta/pull/2145 I think we can always enable optimized write after porting this?

liurenjie1024 avatar Apr 18 '24 06:04 liurenjie1024

This is a Databricks-specific behavior per the doc linked above, not a behavior in OSS Delta Lake, at least for the versions of OSS Delta Lake that we support. There's already a separate issue for tracking the OSS versions of optimized write and auto compact, see #10397 and #10398, respectively, but I do not see it as being relevant for this issue. We already support optimized write and auto compact on Databricks.

jlowe avatar Apr 18 '24 13:04 jlowe

I'll take this.

liurenjie1024 avatar Apr 19 '24 14:04 liurenjie1024