hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-8196] Support pruning based on partition stats index in Hudi Flink

Open cshuo opened this issue 1 year ago • 1 comments

Change Logs

This PR introduces a new partition pruner for Flink source based on the Partition Stats Index.

Before this PR, Flink source (batch or streaming) uses partition filters pushed down to build partition pruner and filter irrelevant partitions. Then, Column Stats Index is used to build data pruner to do the file-level data skipping. HUDI-7144 introduced the partition-level column stats, we can use the stats to prune partitions just like the way files are pruned.

Main changes:

  • Add utilities to fetch Partition Stats Index data for Flink source.
  • Add a new partition pruner ColumnStatsPartitionPruner.
  • Add new config read.partition.data.skipping.enabled to enabled pruning based on partition stats, false by default.

Impact

Enhance the data skipping ability for Flink source by introducing a new partition pruner based on Partition Stats Index.

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

cshuo avatar Oct 21 '24 02:10 cshuo

@danny0405 please take another look, thanks.

cshuo avatar Oct 23 '24 01:10 cshuo

CI report:

  • efde4491a94d517528c04c94449be0aee37d262e Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Oct 25 '24 03:10 hudi-bot