hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7958] Create partition stats index for all columns when no cols specified

Open codope opened this issue 1 year ago • 3 comments

Change Logs

Just like column stats index, we can create partition stats index for all column if no columns configured by the user.

Impact

Users don't necessarily have to configure columns to aggregate stats at partition level.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

codope avatar Jul 05 '24 15:07 codope

I really feel we should cut down on the no of cols we generate stats out of the box. I have encountered OSS users give col stats a try and since it takes lot of time to populate col stats if their schema is wide, they give up on col stats. They don't know why its slow. just that the exp is not good, so they disable col stats and move on.

nsivabalan avatar Jul 05 '24 15:07 nsivabalan

I really feel we should cut down on the no of cols we generate stats out of the box. I have encountered OSS users give col stats a try and since it takes lot of time to populate col stats if their schema is wide, they give up on col stats. They don't know why its slow. just that the exp is not good, so they disable col stats and move on.

This PR makes the behavior of the col_stats and partition_stats index consistent in terms of what columns to generate the index. Before this PR, if no value is specified in hoodie.metadata.index.column.stats.column.list, column stats of all columns are generated, while the partition stats is not generated at all.

We can cut down the number of columns for generating columns stats by default. That should be tackled in a separate PR.

yihua avatar Jul 26 '24 01:07 yihua

CI report:

  • 3854e3a27aa07675804e1e89b9bb5dd82c2c6485 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 05 '24 13:09 hudi-bot