hudi
hudi copied to clipboard
[HUDI-7404] Bloom execution improvements
Change Logs
- Avoids collecting intermediate results until they are truly required in a centralized place (driver for spark)
- Early exit when no matches after bloom filter check
- Lazily evaluate a count only used in one, non-default code path of the spark writer flow to avoid the extra count costs
Impact
Improve efficiency of the bloom filter for spark users
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
Open questions: I see that the code is currently loading the ranges for all the files in all of the affected partitions into a single object for the range filtering. Should we try to leverage spark limit the evaluation to a single partition or some cluster of files within that partition?
Does it make sense to pull the evaluation of the bloom filter check into this step as well? Right now we'll read the footers twice but if we can create a cluster of files for each key to evaluate against, we could just read the range and bloom filter at the same time and do the evaluation then for the files that the key may be a part of.
CI report:
- 86a6e24f202a76c316086b59fc69308c57631b4e UNKNOWN
- 68bf61a85db16d50aa0663be7652874baf30489c Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
hey @the-other-tim-brown : looks like there are some test failures. can you follow up?
hey @the-other-tim-brown : looks like there are some test failures. can you follow up?
All PRs have been failing recently due to a bad merge on master. I'll pick up the fix in my branch today.