hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-4878] Fix incremental cleaner use case

Open parisni opened this issue 3 years ago • 13 comments

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS policies. It is not run when KEEP_LATEST_FILE_VERSIONS.

This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.

Here is the scenario of the problem:

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits. For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.

Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.

This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS

Hope this is clear enough

Risk level: none | low | medium | high

Choose one. If medium or high, explain what verification was done to mitigate the risks.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

parisni avatar Aug 25 '22 13:08 parisni

@parisni : hey I am bit confused on your example. Can you illustrate which file group each commit is updating. whether each commit is a new file group or updating the same file group. If you can clarify that, I will go over the issue again to see if the fix makes sense. really appreciate you putting up the fix.

nsivabalan avatar Sep 01 '22 18:09 nsivabalan

each commit is a new file group That's it. Hope this help

On September 1, 2022 6:48:09 PM UTC, Sivabalan Narayanan @.> wrote: @. : hey I a, bit confused on your example. Can you illustrate which file group each commit is updating. whether each commit is a new file group or updating the same file group. If you can clarify that, I will go over the issue again to see if the fix makes sense.

really appreciate you putting up the fix.

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/6498#issuecomment-1234654692 You are receiving this because you were mentioned.

Message ID: @.***>

parisni avatar Sep 01 '22 21:09 parisni

got it, thanks. Let me go through the example you have put up and clarify few things.

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0 added file1_V1.parquet
commit-1. added file1_V2.parquet
commit-2 added file1_V3.parquet
partition-B/
commit-3 added file2_V1.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of earliest commit retained in this case which is commit2. This earliest commit retained is the one we will leverage later to do incremental cleaning.

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2).

Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example.

nsivabalan avatar Sep 04 '22 01:09 nsivabalan

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

I assume you meant file1_v2 ?

Let me read again the source that's not what I understood and also tested so far.

On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan @.***> wrote:

got it, thanks. Let me go through the example you have put up and clarify few things.

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0 added file1_V1.parquet
commit-1. added file1_V2.parquet
commit-2 added file1_V3.parquet
partition-B/
commit-3 added file2_V1.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of earliest commit retained in this case which is commit2. This earliest commit retained is the one we will leverage later to do incremental cleaning.

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2).

Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example.

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/6498#issuecomment-1236228935 You are receiving this because you were mentioned.

Message ID: @.***>

parisni avatar Sep 05 '22 08:09 parisni

well, after read again in particular that method https://github.com/apache/hudi/blob/ca8a57a21d163e573e3a617fd6173fa4b913666c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177 it does what exactly what the logger says :

    LOG.info("Incremental Cleaning mode is enabled. Looking up
partition-paths that have since changed "
        + "since last cleaned at " +
cleanMetadata.getEarliestCommitToRetain()
        + ". New Instant to retain : " + newInstantToRetain);

But you were right : newInstantToRetain is commit2. In my mind it was the cleaning Commit which woud have been commit4.

this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning

if we pin the cleaning commit (commit4) then we can apply incremental cleaning together with KEEP_LATEST_FILE_VERSIONS.

On Mon, 2022-09-05 at 08:32 +0000, Nicolas Paris wrote:

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

I assume you meant file1_v2 ?

Let me read again the source that's not what I understood and also tested so far.

On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan @.***> wrote:

got it, thanks. Let me go through the example you have put up and clarify few things.

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0 added file1_V1.parquet
commit-1. added file1_V2.parquet
commit-2 added file1_V3.parquet
partition-B/
commit-3 added file2_V1.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of earliest commit retained in this case which is commit2. This earliest commit retained is the one we will leverage later to do incremental cleaning.

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2).

Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example.

-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/pull/6498#issuecomment-1236228935 You are receiving this because you were mentioned.

Message ID: @.***>

parisni avatar Sep 05 '22 18:09 parisni

@nsivabalan added a (git) commit, according to last comment

parisni avatar Sep 05 '22 19:09 parisni

here is my take: don't think we can go w/ this fix as is. but we can try something else.

we need to maintain another variable called, lastCompletedCommit whenever a cleaner kicks in. so, it will be added as part of clean commit metadata. infact we landed a patch 2 days back that adds this. https://github.com/apache/hudi/pull/5478

let me walk through a scenario. cleaner policy: based on latest file versions. count: 2

Commit1 : FG1_V1, FG2_V2

Commit2: FG1_V2, FG3_V1

Commit3: FG1_V3, FG3_V2, FG4_V1

Commit4: FG1_V4, FG2_V2, FG4_V2, FG5_V1

Commit5: FG2_V3, FG5_V2

for first two commits, nothing will get cleaned up. and so "lastCompletedCommit" = null.

just after C3, lets say we trigger cleaning. C3_C FG_1_V1 will be cleaned up. and will mark lastCompletedCommit = C3.

just after C4, we trigger cleaner again. we can just look at file groups that got newer versions after C3 until now. in this case, just in C4. So, while planning we can consider only file groups FG1, FG2, FG4 and FG5. reasoning is, those file groups which was never touched after last clean may not reach the max file version threshold. bcoz, they did not receive any updates only.

So, with C4_Clean, we will consider only FG1, FG2, FG4 and FG5 and eventually will clean up FG1_V2.

And update lastCompletedCommit = C4.

just after C5, we trigger cleaner again. will consider only file groups updated after C4 (lastCompletedCommit from last clean). which is FG2, FG5. and will result in cleaning up FG2_V1. since FG1, FG3 nor FG4 was touched, we don't need to even consider them while preparing the clean plan.

the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there.

nsivabalan avatar Sep 15 '22 21:09 nsivabalan

let me know if I am making sense. we can have a sync call too since this is dragging a bit. wanted to land in the next few days.

nsivabalan avatar Sep 15 '22 21:09 nsivabalan

@parisni : hey hi. we have a code freeze coming up in a weeks time for 0.12.1. Just wanted to keep you informed.

nsivabalan avatar Sep 20 '22 03:09 nsivabalan

@codope: Can you review this patch. I have overhauled the initial fix put up. But could result in good perf improv for cleaning. I am yet to write tests. but do take a look at my logic and let me know if it looks ok. or is there any case that I could be missing.

nsivabalan avatar Sep 21 '22 23:09 nsivabalan

the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there.

Agreed thanks

parisni avatar Sep 22 '22 12:09 parisni

@parisni : can you also review the patch. does this look ok. may be you can try it in your staging pipeline and let us know if things are working as expected (i.e. incremental cleaning kicks in for cleaning based on file versions)

nsivabalan avatar Sep 22 '22 22:09 nsivabalan

CI report:

  • 3f29d7aa49c61327c8cb9157cecaf797cc7af405 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 27 '22 15:09 hudi-bot

I was working on the same idea before finding this PR. There is a case not taken into account in this PR.

Let's assume we have these configs:

  • the cleaning policy is KEEP_LATEST_FILE_VERSIONS
  • the incremental cleaning is activated
  • and the cleaner.fileversions.retained is X

if in the next clean we use a cleaner.fileversions.retained >= X, everything is fine, we just need to check the changed partitions since the last clean and clean them. But if in the next clean we change the cleaner.fileversions.retained to any value < X, we will not be able to use the incremental cleaning, and we will need to run the method getPartitionPathsForFullCleaning to recheck all the partitions.

So to decide if we can use the incremental cleaning or we need to run a brute force check, we need to save the cleaner.fileversions.retained value used in the last clean in the clean commit avro file.

hussein-awala avatar Nov 02 '22 23:11 hussein-awala