hudi
hudi copied to clipboard
[HUDI-4760] Fixing repeated trigger of data file creations w/ clustering
Change Logs
Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD<WriteStatus> which ended up retriggering the action.
Impact
Could improve the clustering performance.
Risk level: medium
If not for the fix, clustering could be triggered twice, but only one set of files will be included in the final commit metadata. Duplicated copy will be deleted during marker reconciliation step.
Test/Verification: Manually verified that if not for the fix, markers are created twice(two files differ just in write token) and later reconcilation step deletes one of them. With the fix, I don't see such duplicates. Only one file is created for clustering and during reconciliation, nothing gets deleted.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
@nsivabalan can we also add a test that after running clustering we don't have unexpected files in the table?
Addressed all comments.
CI is green

CI report:
- f9810e78fbc61c0dce902a74e426609c74788708 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build