hudi
hudi copied to clipboard
[HUDI-7235] Fix checkpoint bug for S3/GCS Incremental Source
Change Logs
Fix bug in checkpointing logic for S3/GCS in empty dataset use-case. The reason for the bug was following.
1st delta commit's checkpoint, processed 3 files.
23/12/06 16:55:26 INFO S3EventsHoodieIncrSource : Querying S3 with:00000000000000, queryInfo:Query information for Incremental Source queryType: snapshot, previousInstant: 00000000000000, startInstant: 00000000000000, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
23/12/06 16:55:42 INFO S3EventsHoodieIncrSource : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
23/12/06 16:55:46 INFO S3EventsHoodieIncrSource : Adjusted end checkpoint :20231206150423946#ee-facts/0012_part_00.parquet
23/12/06 16:55:49 INFO S3EventsHoodieIncrSource : Total number of files to process :3
2nd delta commit was an empty one and the checkpoint returned was 20231206150423946 which is not a valid checkpoint progression because it should either be equal or increase monotonically (based on lexicographical order)
23/12/06 16:59:52 INFO S3EventsHoodieIncrSource : Querying S3 with:20231206150423946#ee-facts/0012_part_00.parquet, queryInfo:Query information for Incremental Source queryType: incremental, previousInstant: 00000000000000, startInstant: 20231206150423946, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
23/12/06 16:59:53 INFO S3EventsHoodieIncrSource : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
23/12/06 16:59:55 INFO S3EventsHoodieIncrSource : Empty source, returning endpoint:20231206150423946
As the previous commits' checkpoint was a faulty one, the 3rd commit read the same set of files again and wrote duplicate data.
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
Medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
None, this is a bug fix for an existing feature.
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [ ] CI passed
@vinishjail97 : Can you address these comments and land it.
@vinishjail97 : Can you address these comments and land it.
Updated the diff after addressing review comments.
CI report:
- de49a9da9db751d6fd6e0eaa1a750f8726a55018 UNKNOWN
- 1b754dffcc5dc2f82c62de06ed9d037ac201d194 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build