delta icon indicating copy to clipboard operation
delta copied to clipboard

Support Auto Compaction

Open sezruby opened this issue 2 years ago • 5 comments

Description

Support Auto Compaction described in: https://docs.databricks.com/delta/optimizations/auto-optimize.html#how-auto-compaction-works

We can support Auto compaction via a new post commit hook and OptimizeCommand with less size threshold.

spark.databricks.delta.autoCompact.enabled (default: false) spark.databricks.delta.autoCompact.maxFileSize (default: 128MB) spark.databricks.delta.autoCompact.minNumFiles (default: 50)

The configs above are same as Databricks Auto compaction.

New config1 - autoCompact.maxCompactBytes

As it will be triggered after every table update, I introduced another config to control the total amount of data to be optimized for an auto compaction operation: spark.databricks.delta.autoCompact.maxCompactBytes (default: 50G)

In Databricks, it's adjusted based on available cluster resources. The config is a quick and easy workaround for it.

New config2 - autoCompact.target

The PR adds another new config - autoCompact.target to change target files for auto compaction. spark.databricks.delta.autoCompact.target (default: "table")

  • table: target all files in the table
  • commit: target only added/updated files of the commit which is triggering auto compaction.
  • partition: target only the partitions containing any of added/updated files of the commit which is triggering auto compaction.

Users are usually writing/updating data only for few partitions, and don't expect changes in other partitions. In case the table is not optimized, the default behavior table might cause some conflicts between other partitions unexpectedly and added/updated files in the triggering commit might not be optimized if there are many small files in other partitions.

Fixes #815

How was this patch tested?

Unit tests

Does this PR introduce any user-facing changes?

Support Auto compaction feature

sezruby avatar May 27 '22 05:05 sezruby

I didn't write a design doc & issue since it's straightforward. Please let me know if we need a design documentation.

sezruby avatar May 27 '22 16:05 sezruby

Hi @sezruby - thanks for this PR! It will take some time for us to review and verify it. We will get back to you.

scottsand-db avatar May 27 '22 17:05 scottsand-db

Hi @sezruby - just updating you with the status on our end. We are very busy with planned features for the next release of Delta Lake, as well with preparation for the upcoming Data and AI summit in June.

So, it will take us some time to get back to you on this.

scottsand-db avatar May 31 '22 21:05 scottsand-db

@vkorukanti Could you review the PR when you have the time? TIA!

sezruby avatar Jul 26 '22 03:07 sezruby

@vkorukanti Could you review the PR when you have the time? TIA!

@vkorukanti @scottsand-db A gentle reminder. This one is simpler than Optimize Write so I would like to merge this PR first.

sezruby avatar Aug 04 '22 03:08 sezruby

Can you please fix the conflicts?

scottsand-db avatar Sep 15 '22 15:09 scottsand-db

@scottsand-db @zsxwing Could you review the PR?

sezruby avatar Sep 29 '22 06:09 sezruby

We are also having this issue, we can't define disjoint conditions from both merge and optimize if they are done concurrently.

pedrosalgadowork avatar Oct 20 '22 16:10 pedrosalgadowork

We are also having this issue, we can't define disjoint conditions from both merge and optimize if they are done concurrently.

@pedrosalgadowork which issue do you mean by? is it related to auto compaction?

sezruby avatar Oct 21 '22 02:10 sezruby

@scottsand-db @zsxwing @tdas Could you review the PR?

sezruby avatar Oct 21 '22 02:10 sezruby

@scottsand-db @zsxwing @tdas - can you help review this PR? Its been open for several months now with no updates/comments recently.

rasidhan avatar Nov 20 '22 14:11 rasidhan

Would be great to have this on Delta 2.3. Is it the plan to merge it soon?

felipepessoto avatar Dec 22 '22 02:12 felipepessoto

Looks like there's some conflicts with the new DV stuff, had to update some things rebasing things on the 2.3 release in my fork.

Would be great to get some more looks at this and get this merged in, this is a highly valuable and missing feature.

Kimahriman avatar Apr 09 '23 13:04 Kimahriman

@dennyglee @scottsand-db @zsxwing @tdas Could you review the PR?

sezruby avatar Jun 23 '23 20:06 sezruby

@dennyglee @scottsand-db @zsxwing @tdas Could you review the PR? I'll resolve the conflict once you started actively reviewing.

sezruby avatar Jul 06 '23 05:07 sezruby

@dennyglee @scottsand-db @zsxwing @tdas @allisonport-db Could you review the PR?

sezruby avatar Jul 17 '23 19:07 sezruby

Is there any obstacle to the review of this PR?

resulyrt93 avatar Aug 16 '23 14:08 resulyrt93

@sezruby In Class spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala method "groupFilesIntoBins" val filteredByBinSize = bins.filter { bin => // bin size is equal to or greater than autoCompactMinNumFiles files bin.size >= autoCompactMinNumFiles || // or bin size + number of deletion vectors >= autoCompactMinNumFiles files bin.count(_.deletionVector != null) + bin.size >= autoCompactMinNumFiles }.map(b => (partition, b))

why are we using individual bin.size while comparing to autoCompactMinNumFiles ?

If total files size are greater than autoCompact.maxFileSize and total number of files are > MinNumFiles, but after segregating it in bins by size the individual bins will always have lesser files than MinNumFiles and hence it will not auto-compact the files.

Any particular reason for doing that ? i understand it might cause compaction of some small file but isn't it better than no compaction ?

takkarharsh avatar Oct 29 '23 03:10 takkarharsh