hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7239] add maxPendingClusteringPlanNums for clustering

Open LXin96 opened this issue 1 year ago • 5 comments

Change Logs

when async clustering start, currently if satisfy the delta-commit, it will generate clustering plan. however, when last clustering plan isn't completed, the next clustering plan will also schedule, maybe we can support a param to controll the max pending cluster plan num.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

LXin96 avatar Dec 18 '23 14:12 LXin96

CI report:

  • e191cc088f2231734256194b8e9c1bd69b17d475 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Dec 18 '23 19:12 hudi-bot

Can you elaborate a little more what we gain by introducing the param here?

danny0405 avatar Dec 19 '23 03:12 danny0405

en , ok. When I test clustering, in generally, a clustering plan generated, we should process the first clustering plan's substask, however i find that when the last clustering plan's subtask aren't finished, the next clustering plan's subtask also start to run. So I hope a param to control the clustering plan count, when the last clustering plan's subtask aren't all finished, we should better not to schedule the next. when flink job is huge data to clustering

LXin96 avatar Dec 19 '23 09:12 LXin96

we should better not to schedule the next.

When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?

danny0405 avatar Dec 20 '23 12:12 danny0405

we should better not to schedule the next.

When data volumn is huge, wouldn't it be better we have more smaller plans than much bigger ones?

+1.

bvaradar avatar Dec 20 '23 21:12 bvaradar

Since, the current behavior is expected, let me know if it is ok to close this PR

Thanks, Balaji.V

bvaradar avatar Jan 23 '24 05:01 bvaradar

Closing this PR as the current behavior is expected. @LXin96 feel free to reopen the PR if you have better reasoning.

yihua avatar Mar 09 '24 16:03 yihua