delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request][Spark] VACUUM FULL should allow subsequent VACUUM LITE

Open istreeter opened this issue 10 months ago • 6 comments

Feature request

Which Delta project/connector is this regarding?

  • [x] Spark
  • [ ] Standalone
  • [ ] Flink
  • [ ] Kernel
  • [ ] Other (fill in here)

Overview

If a table does not meet the requirements for VACCUM LITE, then it should be possible to run a one-off VACUUM FULL to unblock subsequent VACUUM LITE.

Motivation

This will benefit any old delta table for which the Delta log has been pruned already. Currently it is impossible to ever run a VACUUM LITE on these tables.

Further details

I have a delta table which is >1 year old, and which has never been vacuumed. The table does not meet the requirements for a VACUUM LITE, i.e. (from the docs):

If VACUUM LITE cannot be completed because the Delta log has been pruned a DELTA_CANNOT_VACUUM_LITE exception is raised.

Currently, I can never ever run VACUUM LITE on this old table, because I cannot get past the DELTA_CANNOT_VACUUM_LITE exception.

I think it's a fairly easy fix: we just need VACUUM FULL to persist the latestCommitVersionOutsideOfRetentionWindow into the _last_vacuum_info file. Currently, latestCommitVersionOutsideOfRetentionWindowOpt is calculated for LITE only, but not for FULL. If we can fix VACUUM FULL to write a non-empty _last_vacuum_info file, then I think subsequent VACUUM LITE will just work without problem.

Please can you confirm whether this is a good idea, and then I'd be happy to contribute a PR. Am I correct that currently it is impossible to ever run VACUUM LITE on a table whose Delta log has been pruned?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • [ ] Yes. I can contribute this feature independently.
  • [x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • [ ] No. I cannot contribute this feature at this time.

istreeter avatar Feb 16 '25 21:02 istreeter

Hello @istreeter , did you find another solution for this problem? I have the same situation here, and I was wondering. maybe, if matching the logRetention days to fileRetentionDays and run a previous VACUUM FULL before VACUUM LITE could solve the problem, even if I don't have all my delta log files (due log pruning). But it doesn't seems to work, have you tried it?. About this question:

" Am I correct that currently it is impossible to ever run VACUUM LITE on a table whose Delta log has been pruned?"

Were you able to confirm it? If it's true, so probably I would need to recreate all the delta tables that were already pruned.

messerzen avatar Jul 08 '25 20:07 messerzen

Hi @messerzen

did you find another solution for this problem?

Unfortunately no. Since I learnt about this problem, I have just been doing a VACCUM FULL because that is the only way I can get it to work. I tried changing a few config options, but I agree with your observation that nothing seems to make it possible to run VACUUM LITE.

Were you able to confirm it?

Nobody else has confirmed it to me. But I read through the code in this repo, and the code seems to match what I described.

If it's true, so probably I would need to recreate all the delta tables that were already pruned.

Yeah that option should work, if that option is available to you. It seems like quite an expensive and disruptive fix though.

istreeter avatar Jul 10 '25 22:07 istreeter

Hello @istreeter , thanks for answering!

I found a workaround for this problem. In summary, in my spark script, I check if:

  • The file _last_vacuum_info exists and is not empty (size=0) - even when the vacuum lite fails, the file is created, but without the latestCommitVersionOutsideOfRetentionWindowOpt info.

So, in case it does not exists, or exists with size=0, I manually create the file with the information {"latestCommitVersionOutsideOfRetentionWindowOpt":40} where 40 is the oldest comit file 00000000000040.json for instance.

It's important that the file content needs to be exactly formed with:

  • dobuel quoted "latestCommitVersionOutsideOfRetentionWindowOpt"
  • No spaces between the : and the value.

So, after this checkout, you can safely run the VACUUM LITE command.

messerzen avatar Jul 10 '25 23:07 messerzen

I have the same issue, for tables that exist for a long time already vacuum lite is not running even after performing a vacuum full to the table.

danieljsc avatar Oct 27 '25 11:10 danieljsc

@messerzen tried your workaround and it didn't work. Even after populating the _last_vacuum_info file, it was requesting the FULL vacuum to clean orphaned files, and after running the FULL vacuum, the content of _last_vacuum_info is gone from the file. Any ideas what it could be?

danieljsc avatar Oct 31 '25 02:10 danieljsc

Hey @istreeter , @danieljsc , @messerzen ,

I have made the fix in the code and raised a PR ,tested it internally and it have tested it. All details provided in the PR .Please review it and you can build a jar out of the PR and test it until reviewers merge this

AnudeepKonaboina avatar Nov 28 '25 06:11 AnudeepKonaboina