[SUPPORT] Duplicate records getting inserted at Hudi S3 sink
Describe the problem you faced
We are using hudi in our cdc pipeline which consumes from MSK topics. We are getting duplicate entries in our tables. To brief the scenario, we are using Hudi v0.12.2 on emr-6.10.1. Our data is captured from MySQL events, which can be of insert or upsert type, also the spark application looks fine. The tables that we are seeing duplicates in are mostly MOR ones with BLOOM index.
Expected behavior
Any update to coming in to the pipeline should update the already existing record.
**Environment Description The attachment here contains:
- Query being used to detect the duplicates,
- Metafields,
- Sample output of the query showing duplicates
- Hudi configs Attachment.txt
**
-
Hudi version : 0.12.2
-
Spark version : Spark 3.3.1
-
Hive version : Hive 3.1.3
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : Yes
Additional context
- EMR Version: emr-6.10.1
We have also tried removing the duplicates manually with a python script and reran the job, but after running for few days it again started inserting duplicate records.
Screenshot of the latest timeline :
@ArpitAdhikari Can you check your timeline if 20240807151109878 and 20240820153145763 exists in timeline? Can you please provide your zipped .hoodie directory if possible ?
Since this was a production application, I had to remove the duplicates and rerun the job. I am trying to reproduce the scenario, once done will update here
Sure. @ArpitAdhikari Also look for any rollbacks or spark retries. Can you use hudi 0.12.3 as that have some more bug fixes.
Hey @ad1happy2go, I have gotten the same issue occuring for another table. I am attaching the query result for the query below:
SELECT * FROM table_a where _hoodie_record_key = '11111'; Output: queryOutput.csv
Timeline Zipped: hoodie_timeline.zip
Also, I checked the timeline and both the files are present in timeline.
Hey @ad1happy2go , can you confirm if this issue has been solved in Hudi v0.14.1? We are running on emr-7.1.0, Hive 3.1.3, Spark 3.5.0
@ArpitAdhikari There was no open issue related to this with BLOOM. There were couple issue one related to GLOBAL_BLOOM and another related hoodie.bloom.index.use.metadata as true. I reviewed the timeline and i see you are using async cleaning. Are you setting lock provider for the same? I see that the dups got created by subsequent compaction commits.
Can you try with 0.14.1 or 0.15.0 to confirm if you still see the issue.
@ad1happy2go We recently moved to Hudi 0.14.1 to avoid this issue but just today got to see the duplicates again in few of the tableTo answer your question, yes we use aysnc cleaning with below properties: "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL" "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.InProcessLockProvider"
@ArpitAdhikari Thanks , just reviewed your configs too. It all looks good to me. So this issue might still be there. I tried reproducing but not able to do so.
Can we get into a call to look into this more.
@ArpitAdhikari Sure, lets coordinate on slack.
Hi @ArpitAdhikari
I hope this issue is now resolved. If you are still encountering problems, please let me know so we can schedule a quick call to troubleshoot. I'll keep this ticket open for one more week and close it if there's no further response. Feel free to reopen it if the issue becomes reproducible again in the future.
Closing this issue because the user don't have any follow-up questions.
@rangareddy - We are facing the same issue. Would you be able to help us?