hudi [SUPPORT] Duplicate records getting inserted at Hudi S3 sink

Describe the problem you faced

We are using hudi in our cdc pipeline which consumes from MSK topics. We are getting duplicate entries in our tables. To brief the scenario, we are using Hudi v0.12.2 on emr-6.10.1. Our data is captured from MySQL events, which can be of insert or upsert type, also the spark application looks fine. The tables that we are seeing duplicates in are mostly MOR ones with BLOOM index.

Expected behavior

Any update to coming in to the pipeline should update the already existing record.

**Environment Description The attachment here contains:

Query being used to detect the duplicates,
Metafields,
Sample output of the query showing duplicates
Hudi configs Attachment.txt

**

Hudi version : 0.12.2
Spark version : Spark 3.3.1
Hive version : Hive 3.1.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes

Additional context

EMR Version: emr-6.10.1

We have also tried removing the duplicates manually with a python script and reran the job, but after running for few days it again started inserting duplicate records.

Screenshot of the latest timeline :

Aug 27 '24 10:08 ArpitAdhikari

@ArpitAdhikari Can you check your timeline if 20240807151109878 and 20240820153145763 exists in timeline? Can you please provide your zipped .hoodie directory if possible ?

Aug 27 '24 10:08 ad1happy2go

Since this was a production application, I had to remove the duplicates and rerun the job. I am trying to reproduce the scenario, once done will update here

Aug 27 '24 11:08 ArpitAdhikari

Sure. @ArpitAdhikari Also look for any rollbacks or spark retries. Can you use hudi 0.12.3 as that have some more bug fixes.

Aug 27 '24 11:08 ad1happy2go

Hey @ad1happy2go, I have gotten the same issue occuring for another table. I am attaching the query result for the query below:

SELECT * FROM table_a where _hoodie_record_key = '11111'; Output: queryOutput.csv

Timeline Zipped: hoodie_timeline.zip

Also, I checked the timeline and both the files are present in timeline.

Aug 28 '24 07:08 ArpitAdhikari

Hey @ad1happy2go , can you confirm if this issue has been solved in Hudi v0.14.1? We are running on emr-7.1.0, Hive 3.1.3, Spark 3.5.0

Aug 30 '24 08:08 ArpitAdhikari

@ArpitAdhikari There was no open issue related to this with BLOOM. There were couple issue one related to GLOBAL_BLOOM and another related hoodie.bloom.index.use.metadata as true. I reviewed the timeline and i see you are using async cleaning. Are you setting lock provider for the same? I see that the dups got created by subsequent compaction commits.

Sep 05 '24 10:09 ad1happy2go

Can you try with 0.14.1 or 0.15.0 to confirm if you still see the issue.

Sep 05 '24 10:09 ad1happy2go

@ad1happy2go We recently moved to Hudi 0.14.1 to avoid this issue but just today got to see the duplicates again in few of the tableTo answer your question, yes we use aysnc cleaning with below properties: "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL" "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.InProcessLockProvider"

Sep 05 '24 11:09 ArpitAdhikari

@ArpitAdhikari Thanks , just reviewed your configs too. It all looks good to me. So this issue might still be there. I tried reproducing but not able to do so.

Can we get into a call to look into this more.

Sep 05 '24 11:09 ad1happy2go

@ArpitAdhikari Sure, lets coordinate on slack.

Sep 05 '24 12:09 ad1happy2go

Hi @ArpitAdhikari

I hope this issue is now resolved. If you are still encountering problems, please let me know so we can schedule a quick call to troubleshoot. I'll keep this ticket open for one more week and close it if there's no further response. Feel free to reopen it if the issue becomes reproducible again in the future.

Oct 30 '25 08:10 rangareddy

Closing this issue because the user don't have any follow-up questions.

Nov 05 '25 13:11 rangareddy

@rangareddy - We are facing the same issue. Would you be able to help us?

Nov 11 '25 11:11 somanath-goudar