hudi [HUDI-7239] add maxPendingClusteringPlanNums for clustering

Change Logs

when async clustering start, currently if satisfy the delta-commit, it will generate clustering plan. however, when last clustering plan isn't completed, the next clustering plan will also schedule, maybe we can support a param to controll the max pending cluster plan num.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Dec 18 '23 14:12 LXin96

CI report:

e191cc088f2231734256194b8e9c1bd69b17d475 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Dec 18 '23 19:12 hudi-bot

Can you elaborate a little more what we gain by introducing the param here?

Dec 19 '23 03:12 danny0405

en , ok. When I test clustering, in generally, a clustering plan generated, we should process the first clustering plan's substask, however i find that when the last clustering plan's subtask aren't finished, the next clustering plan's subtask also start to run. So I hope a param to control the clustering plan count, when the last clustering plan's subtask aren't all finished, we should better not to schedule the next. when flink job is huge data to clustering

Dec 19 '23 09:12 LXin96

we should better not to schedule the next.

When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?

Dec 20 '23 12:12 danny0405

we should better not to schedule the next.

When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?

+1.

Dec 20 '23 21:12 bvaradar

Since, the current behavior is expected, let me know if it is ok to close this PR

Thanks, Balaji.V

Jan 23 '24 05:01 bvaradar

Closing this PR as the current behavior is expected. @LXin96 feel free to reopen the PR if you have better reasoning.

Mar 09 '24 16:03 yihua

hudi hudi copied to clipboard

[HUDI-7239] add maxPendingClusteringPlanNums for clustering

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard