hudi [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering

[HUDI-4760] Fixing repeated trigger of data file creations w/ clustering

Open nsivabalan opened this issue 3 years ago • 4 comments

Change Logs

Apparently in clustering, data file creations are triggered twice since we don't cache the write status and for doing some validation, we do isEmpty on JavaRDD<WriteStatus> which ended up retriggering the action.

Impact

Could improve the clustering performance.

Risk level: medium

If not for the fix, clustering could be triggered twice, but only one set of files will be included in the final commit metadata. Duplicated copy will be deleted during marker reconciliation step.

Test/Verification: Manually verified that if not for the fix, markers are created twice(two files differ just in write token) and later reconcilation step deletes one of them. With the fix, I don't see such duplicates. Only one file is created for clustering and during reconciliation, nothing gets deleted.

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Sep 01 '22 04:09 nsivabalan

@nsivabalan can we also add a test that after running clustering we don't have unexpected files in the table?

Sep 01 '22 16:09 alexeykudinkin

Addressed all comments.

Sep 03 '22 21:09 nsivabalan

CI is green Screen Shot 2022-09-19 at 7 47 11 AM

Sep 19 '22 14:09 nsivabalan

CI report:

f9810e78fbc61c0dce902a74e426609c74788708 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Sep 24 '22 03:09 hudi-bot

hudi hudi copied to clipboard

[HUDI-4760] Fixing repeated trigger of data file creations w/ clustering

Change Logs

Impact

Contributor's checklist

CI report:

hudi
hudi copied to clipboard