hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Avoid rescan file when clean

Open Ytimetravel opened this issue 1 year ago • 7 comments

Describe the problem you faced

Dear community, when using the cow table, I found that it may trigger an OOM error in driver when clean. I find that this is due to cow table rarely updating data, so there are usually no files that need to be cleaned. However, the clean operation is still called every time data is written, and when the number of files in the list reaches a certain threshold, it may cause OOM.

Case like: instant1 instant2 instant3 instant4(scan 1、2、3、4 but no clean) instant1 instant2 instant3 instant4 instant5 (scan 1、2、3、4、5 but no clean)

Is it possible to add an empty clean instant to mark it, to avoid rescanning every time, or are there any better ideas? Looking forward to your reply~

Environment Description

  • Hudi version :0.14.0

Ytimetravel avatar Jul 18 '24 12:07 Ytimetravel

@Ytimetravel One way is to set https://hudi.apache.org/docs/configurations/#hoodiecleanmaxcommits to a higher value, so cleaner doesn't get called after every commit.

ad1happy2go avatar Jul 18 '24 13:07 ad1happy2go

@ad1happy2go Thank you for your reply, but I think that this does not solve the problem. If the number of commit satisfies the clean condition, similar case will still arise with subsequent writes.

Ytimetravel avatar Jul 19 '24 02:07 Ytimetravel

@Ytimetravel Because our cow table is rarely updated.It will not actually clean.

Ytimetravel avatar Jul 19 '24 02:07 Ytimetravel

@Ytimetravel You are correct, recently we are considering to improve the clean table service for append only use case, the clean table service would inspect the timeline instant actions to make smart decision of whether to trigger a clean planning.

Actually Uber have recently fired a PR of the MARKER idea recently: https://github.com/apache/hudi/pull/11605 and we thought it was kind of hacky so it is opposed.

danny0405 avatar Jul 22 '24 08:07 danny0405

@danny0405 Thanks a lot for reply, I get it.

Ytimetravel avatar Jul 23 '24 09:07 Ytimetravel

@Ytimetravel Closing out this issue. Please reopen in case of any concerns. Thanks.

ad1happy2go avatar Aug 22 '24 09:08 ad1happy2go

Reopen it again because it is of high priority, let's close it when it is tackled.

danny0405 avatar Aug 22 '24 10:08 danny0405