hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7235] Fix checkpoint bug for S3/GCS Incremental Source

Open vinishjail97 opened this issue 8 months ago • 1 comments

Change Logs

Fix bug in checkpointing logic for S3/GCS in empty dataset use-case. The reason for the bug was following.

1st delta commit's checkpoint, processed 3 files.

23/12/06 16:55:26 INFO S3EventsHoodieIncrSource  : Querying S3 with:00000000000000, queryInfo:Query information for Incremental Source queryType: snapshot, previousInstant: 00000000000000, startInstant: 00000000000000, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
	

23/12/06 16:55:42 INFO S3EventsHoodieIncrSource  : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
	
23/12/06 16:55:46 INFO S3EventsHoodieIncrSource  : Adjusted end checkpoint :20231206150423946#ee-facts/0012_part_00.parquet
	

23/12/06 16:55:49 INFO S3EventsHoodieIncrSource  : Total number of files to process :3

2nd delta commit was an empty one and the checkpoint returned was 20231206150423946 which is not a valid checkpoint progression because it should either be equal or increase monotonically (based on lexicographical order)

23/12/06 16:59:52 INFO S3EventsHoodieIncrSource  : Querying S3 with:20231206150423946#ee-facts/0012_part_00.parquet, queryInfo:Query information for Incremental Source queryType: incremental, previousInstant: 00000000000000, startInstant: 20231206150423946, endInstant: 20231206150423946, orderColumn: _hoodie_commit_time, keyColumn: s3.object.key, limitColumn: s3.object.size, orderByColumns: [_hoodie_commit_time, s3.object.key]
	
23/12/06 16:59:53 INFO S3EventsHoodieIncrSource  : Adjusting end checkpoint:20231206150423946 based on sourceLimit :300000000
	
23/12/06 16:59:55 INFO S3EventsHoodieIncrSource  : Empty source, returning endpoint:20231206150423946

As the previous commits' checkpoint was a faulty one, the 3rd commit read the same set of files again and wrote duplicate data.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

Medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

None, this is a bug fix for an existing feature.

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [ ] CI passed

vinishjail97 avatar Dec 15 '23 16:12 vinishjail97

@vinishjail97 : Can you address these comments and land it.

bvaradar avatar Apr 01 '24 18:04 bvaradar

@vinishjail97 : Can you address these comments and land it.

Updated the diff after addressing review comments.

bvaradar avatar Apr 03 '24 06:04 bvaradar

CI report:

  • de49a9da9db751d6fd6e0eaa1a750f8726a55018 UNKNOWN
  • 1b754dffcc5dc2f82c62de06ed9d037ac201d194 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Apr 23 '24 06:04 hudi-bot