hudi
hudi copied to clipboard
[HUDI-7239] add maxPendingClusteringPlanNums for clustering
Change Logs
when async clustering start, currently if satisfy the delta-commit, it will generate clustering plan. however, when last clustering plan isn't completed, the next clustering plan will also schedule, maybe we can support a param to controll the max pending cluster plan num.
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
CI report:
- e191cc088f2231734256194b8e9c1bd69b17d475 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
Can you elaborate a little more what we gain by introducing the param here?
en , ok. When I test clustering, in generally, a clustering plan generated, we should process the first clustering plan's substask, however i find that when the last clustering plan's subtask aren't finished, the next clustering plan's subtask also start to run. So I hope a param to control the clustering plan count, when the last clustering plan's subtask aren't all finished, we should better not to schedule the next. when flink job is huge data to clustering
we should better not to schedule the next.
When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?
we should better not to schedule the next.
When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?
+1.
Since, the current behavior is expected, let me know if it is ok to close this PR
Thanks, Balaji.V
Closing this PR as the current behavior is expected. @LXin96 feel free to reopen the PR if you have better reasoning.