hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Duplicate records getting inserted at Hudi S3 sink

Open ArpitAdhikari opened this issue 1 year ago • 4 comments

Describe the problem you faced

We are using hudi in our cdc pipeline which consumes from MSK topics. We are getting duplicate entries in our tables. To brief the scenario, we are using Hudi v0.12.2 on emr-6.10.1. Our data is captured from MySQL events, which can be of insert or upsert type, also the spark application looks fine. The tables that we are seeing duplicates in are mostly MOR ones with BLOOM index.

Expected behavior

Any update to coming in to the pipeline should update the already existing record.

**Environment Description The attachment here contains:

  1. Query being used to detect the duplicates,
  2. Metafields,
  3. Sample output of the query showing duplicates
  4. Hudi configs Attachment.txt

**

  • Hudi version : 0.12.2

  • Spark version : Spark 3.3.1

  • Hive version : Hive 3.1.3

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : Yes

Additional context

  • EMR Version: emr-6.10.1

We have also tried removing the duplicates manually with a python script and reran the job, but after running for few days it again started inserting duplicate records.

Screenshot of the latest timeline :

Screenshot 2024-08-27 at 3 33 43 PM

ArpitAdhikari avatar Aug 27 '24 10:08 ArpitAdhikari

@ArpitAdhikari Can you check your timeline if 20240807151109878 and 20240820153145763 exists in timeline? Can you please provide your zipped .hoodie directory if possible ?

ad1happy2go avatar Aug 27 '24 10:08 ad1happy2go

Since this was a production application, I had to remove the duplicates and rerun the job. I am trying to reproduce the scenario, once done will update here

ArpitAdhikari avatar Aug 27 '24 11:08 ArpitAdhikari

Sure. @ArpitAdhikari Also look for any rollbacks or spark retries. Can you use hudi 0.12.3 as that have some more bug fixes.

ad1happy2go avatar Aug 27 '24 11:08 ad1happy2go

Hey @ad1happy2go, I have gotten the same issue occuring for another table. I am attaching the query result for the query below:

SELECT * FROM table_a where _hoodie_record_key = '11111'; Output: queryOutput.csv

Timeline Zipped: hoodie_timeline.zip

Also, I checked the timeline and both the files are present in timeline.

ArpitAdhikari avatar Aug 28 '24 07:08 ArpitAdhikari

Hey @ad1happy2go , can you confirm if this issue has been solved in Hudi v0.14.1? We are running on emr-7.1.0, Hive 3.1.3, Spark 3.5.0

ArpitAdhikari avatar Aug 30 '24 08:08 ArpitAdhikari

@ArpitAdhikari There was no open issue related to this with BLOOM. There were couple issue one related to GLOBAL_BLOOM and another related hoodie.bloom.index.use.metadata as true. I reviewed the timeline and i see you are using async cleaning. Are you setting lock provider for the same? I see that the dups got created by subsequent compaction commits.

ad1happy2go avatar Sep 05 '24 10:09 ad1happy2go

Can you try with 0.14.1 or 0.15.0 to confirm if you still see the issue.

ad1happy2go avatar Sep 05 '24 10:09 ad1happy2go

@ad1happy2go We recently moved to Hudi 0.14.1 to avoid this issue but just today got to see the duplicates again in few of the tableTo answer your question, yes we use aysnc cleaning with below properties: "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL" "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.InProcessLockProvider"

ArpitAdhikari avatar Sep 05 '24 11:09 ArpitAdhikari

@ArpitAdhikari Thanks , just reviewed your configs too. It all looks good to me. So this issue might still be there. I tried reproducing but not able to do so.

Can we get into a call to look into this more.

ad1happy2go avatar Sep 05 '24 11:09 ad1happy2go

@ArpitAdhikari Sure, lets coordinate on slack.

ad1happy2go avatar Sep 05 '24 12:09 ad1happy2go

Hi @ArpitAdhikari

I hope this issue is now resolved. If you are still encountering problems, please let me know so we can schedule a quick call to troubleshoot. I'll keep this ticket open for one more week and close it if there's no further response. Feel free to reopen it if the issue becomes reproducible again in the future.

rangareddy avatar Oct 30 '25 08:10 rangareddy

Closing this issue because the user don't have any follow-up questions.

rangareddy avatar Nov 05 '25 13:11 rangareddy

@rangareddy - We are facing the same issue. Would you be able to help us?

somanath-goudar avatar Nov 11 '25 11:11 somanath-goudar