hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7144] Build storage partition stats index and use it for data skipping

Open codope opened this issue 1 year ago • 3 comments

Change Logs

Build storage partition stats index and use it for data skipping. Main changes are as follows:

  • Index is saved as another partition in the metadata table.
  • Each index entry is a key-value, where key is the hash(columnName).concat(hash(partitionName)), and value is the stats.
  • New configs in HoodieMetadataConfig and the writer changes are in HoodieBackedTableMetadataWriter with some util methods in HoodieTableMetadataUtil
  • On the read path, main changes are in HoodieFileIndex. First, the partition pruning happens as usual, then depending on data filters, data can be skipped further if partition stats index is available.

Impact

Stats aggregated by storage partition. Efficient data skipping. Meta sync need not sync the partition metadata. Queries will use the index while planning in the driver.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

codope avatar Dec 18 '23 11:12 codope

I made one skim of the changes. Can you reply on my previous review comments? esp on tests.

vinothchandar avatar Jan 16 '24 19:01 vinothchandar

I made one skim of the changes. Can you reply on my previous review comments? esp on tests.

@vinothchandar Thanks for the review. I have addressed all your comments. The test was passing two commits ago. I am looking into the failures. But, PR is ready to review again.

codope avatar Jan 18 '24 01:01 codope

CI report:

  • f63dbe172cf8dec2603c266396fb7d31d5cb7f60 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Apr 30 '24 16:04 hudi-bot

creaete a issue to track it https://issues.apache.org/jira/browse/HUDI-7829

KnightChess avatar Jun 05 '24 08:06 KnightChess