hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-5053] Create clean complete commit when there is none to clean in order to leverage incremental cleaning

Open hussein-awala opened this issue 3 years ago • 9 comments

Change Logs

When the clean planner lists the files in the partitions and it doesn't find any file to delete, the clean operation is skipped without any commit, then in the next clean, if the incremental cleaning mode is enabled, the clean planner doesn't find any information about the checked commits, and it will recheck all the files a second time. This PR creates a clean commit contains the earliestCommitToRetain regardless the deleted files list, in this case the clean planner will check only the partitions that have been changed since the earliestCommitToRetain in the last clean commit.

Impact

A new clean commit will be added to the timeline even if there was not a real clean operation. For the benefits, a big performance improvement (and cost reduction of S3 listing) in cleaning operation for table where old partitions are seldom changed.

Risk level (write none, low medium or high below)

low: The risk level is low because these changes affects only the clean plans without files to delete, and I kept the checks on the empty commit files to avoid Avro empty file exception, and I improved the method which clean this empty files. If for some reason we have an empty Avro file, a brute force will be performed to prepare the clean plan. I will test these changes on our project within the week to make sure everything is fine

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

hussein-awala avatar Oct 23 '22 15:10 hussein-awala

I tested the PR in our project, it works fine as expected. For each clean we have the 3 states requested, inflight and completed, and the clean planner checks only the partitions that have been modified since earliestCommitToRetain.

Recently, we incremented CLEAN_MAX_COMMITS to 24 as @nsivabalan proposed in order to clean the tables every 24 hours (we have one commit per hour) and avoid listing S3 partitions in the tables with infrequently changed partitions, but the config doesn't work as expected, because after 24 commits, if the list of files to delete is empty, the cleaner will be executed at each next commit until delete something, because for the clean planner, the last clean was when the were some files to delete, and all the next clean operations are not considered because they write nothing to the timeline.

In brief, we need this patch ASAP, can you please add it to 0.13.0?

hussein-awala avatar Oct 25 '22 19:10 hussein-awala

Thank you for raising this PR @hussein-awala .

I will take a look at it by tomorrow.

pratyakshsharma avatar Oct 27 '22 17:10 pratyakshsharma

hey @hussein-awala : thanks for the patch. we can definitely take up this patch. But would prefer to guard it using a new flag. reason is, for those who are running clean after every commit, it could keep producing empty clean commit files in the timeline which could impact the query latency for large scale datasets. So, we can let interested folks enable it if need be. let me know wdyt.

nsivabalan avatar Nov 02 '22 03:11 nsivabalan

Perfect! I was going to suggest that because this is useless when incremental cleaning mode is not activated. I will add a new config for the empty clean commits, and I will duplicate the tests I already fixed (the old version without empty clean commit and the new one with empty clean commit enabled).

for those who are running clean after every commit, it could keep producing empty clean commit files in the timeline which could impact the query latency for large scale datasets.

If they are running clean after every commit with incremental cleaning, it's better to add an empty clean commit to check only the changed partitions since the last commit instead of checked all the changed partitions since the last no empty clean. Based on our tests, I confirm that this improve the query latency and not the opposite. But I will create a new separate config for this patch and not activating it when incremental cleaning is enabled.

hussein-awala avatar Nov 02 '22 22:11 hussein-awala

sg

nsivabalan avatar Nov 02 '22 23:11 nsivabalan

Hey @nsivabalan, can you review the PR please? Feel free to change the name of the new configuration if you have a better suggestion.

hussein-awala avatar Nov 06 '22 22:11 hussein-awala

@codope yes I will rebase it by tomorrow

Is this really necessary? Also, can you please rebase?

yes, if we use the incremental cleaning, we need to store earliestCommitToRetain and use it in the next clean operation to avoid rechecking the partitions we already checked when there is no files to delete. Also as @nsivabalan proposed, I created a new config hoodie.cleaner.allow.empty.commits which is false by default, if it's false, the clean commit will be added only if there is some deleted files. And if it's true, I apply the proposed logic.

hussein-awala avatar Dec 07 '22 13:12 hussein-awala

@codope I rebased it Is there any chance to have this feature in 0.12.2? We've already been using it in production for 1 month without any problem.

hussein-awala avatar Dec 07 '22 19:12 hussein-awala

CI report:

  • 6ab2a114214e2d961abb8f80f535dc02cbae548e Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Dec 28 '23 16:12 hudi-bot