hudi [HUDI-7404] Bloom execution improvements

Change Logs

Avoids collecting intermediate results until they are truly required in a centralized place (driver for spark)
Early exit when no matches after bloom filter check
Lazily evaluate a count only used in one, non-default code path of the spark writer flow to avoid the extra count costs

Impact

Improve efficiency of the bloom filter for spark users

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Jan 28 '24 17:01 the-other-tim-brown

Open questions: I see that the code is currently loading the ranges for all the files in all of the affected partitions into a single object for the range filtering. Should we try to leverage spark limit the evaluation to a single partition or some cluster of files within that partition?

Does it make sense to pull the evaluation of the bloom filter check into this step as well? Right now we'll read the footers twice but if we can create a cluster of files for each key to evaluate against, we could just read the range and bloom filter at the same time and do the evaluation then for the files that the key may be a part of.

Jan 28 '24 18:01 the-other-tim-brown

CI report:

86a6e24f202a76c316086b59fc69308c57631b4e UNKNOWN
68bf61a85db16d50aa0663be7652874baf30489c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

May 07 '24 04:05 hudi-bot

hey @the-other-tim-brown : looks like there are some test failures. can you follow up?

May 07 '24 07:05 nsivabalan

hey @the-other-tim-brown : looks like there are some test failures. can you follow up?

All PRs have been failing recently due to a bad merge on master. I'll pick up the fix in my branch today.

May 07 '24 12:05 the-other-tim-brown

hudi hudi copied to clipboard

[HUDI-7404] Bloom execution improvements

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard