hudi
hudi copied to clipboard
[SUPPORT] Avoid rescan file when clean
Describe the problem you faced
Dear community, when using the cow table, I found that it may trigger an OOM error in driver when clean. I find that this is due to cow table rarely updating data, so there are usually no files that need to be cleaned. However, the clean operation is still called every time data is written, and when the number of files in the list reaches a certain threshold, it may cause OOM.
Case like: instant1 instant2 instant3 instant4(scan 1、2、3、4 but no clean) instant1 instant2 instant3 instant4 instant5 (scan 1、2、3、4、5 but no clean)
Is it possible to add an empty clean instant to mark it, to avoid rescanning every time, or are there any better ideas? Looking forward to your reply~
Environment Description
- Hudi version :0.14.0
@Ytimetravel One way is to set https://hudi.apache.org/docs/configurations/#hoodiecleanmaxcommits to a higher value, so cleaner doesn't get called after every commit.
@ad1happy2go Thank you for your reply, but I think that this does not solve the problem. If the number of commit satisfies the clean condition, similar case will still arise with subsequent writes.
@Ytimetravel Because our cow table is rarely updated.It will not actually clean.
@Ytimetravel You are correct, recently we are considering to improve the clean table service for append only use case, the clean table service would inspect the timeline instant actions to make smart decision of whether to trigger a clean planning.
Actually Uber have recently fired a PR of the MARKER idea recently: https://github.com/apache/hudi/pull/11605 and we thought it was kind of hacky so it is opposed.
@danny0405 Thanks a lot for reply, I get it.
@Ytimetravel Closing out this issue. Please reopen in case of any concerns. Thanks.
Reopen it again because it is of high priority, let's close it when it is tackled.