hudi
hudi copied to clipboard
[HUDI-7958] Create partition stats index for all columns when no cols specified
Change Logs
Just like column stats index, we can create partition stats index for all column if no columns configured by the user.
Impact
Users don't necessarily have to configure columns to aggregate stats at partition level.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
I really feel we should cut down on the no of cols we generate stats out of the box. I have encountered OSS users give col stats a try and since it takes lot of time to populate col stats if their schema is wide, they give up on col stats. They don't know why its slow. just that the exp is not good, so they disable col stats and move on.
I really feel we should cut down on the no of cols we generate stats out of the box. I have encountered OSS users give col stats a try and since it takes lot of time to populate col stats if their schema is wide, they give up on col stats. They don't know why its slow. just that the exp is not good, so they disable col stats and move on.
This PR makes the behavior of the col_stats and partition_stats index consistent in terms of what columns to generate the index. Before this PR, if no value is specified in hoodie.metadata.index.column.stats.column.list, column stats of all columns are generated, while the partition stats is not generated at all.
We can cut down the number of columns for generating columns stats by default. That should be tackled in a separate PR.
CI report:
- 3854e3a27aa07675804e1e89b9bb5dd82c2c6485 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build