delta icon indicating copy to clipboard operation
delta copied to clipboard

Duplicates seen with merge operation

Open Kiran-G1 opened this issue 2 years ago • 2 comments

deltaTable.alias("original")
.merge(batch_of_records.alias("updates"), "original.id = updates.id")
..whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

ID column which was used in this example is unique.

Kiran-G1 avatar Aug 11 '22 16:08 Kiran-G1

Hi @Kiran-G1 can you share a more complete example with sample data? A full reproduction will help us confirm and track this down. Here is a good example, https://github.com/delta-io/delta/issues/1279

nkarpov avatar Aug 11 '22 18:08 nkarpov

There is a lot of very relevant discussion on duplicates in merge here - https://github.com/delta-io/delta/issues/527

tdas avatar Aug 16 '22 01:08 tdas

I partially replicated this issue with a different use case.

Use case is to update the old record's datetime to current time and latest record datetime to 2050 in delta by performing merge operation when a new record comes.

But all of the records were considered as new records...

image

I noticed that all those records got inserted at the same time ( see the load time) into the delta table.

My hunch is, since delta table write happens in parallel by spark and all these records got inserted at the same time , due to this race condition merge condition didn't satisfy.

@tdas @nkarpov

Kiran-G1 avatar Sep 21 '22 05:09 Kiran-G1

Can we focus this discussion on the original issue #527? This is the same problem. Can you add more information on this replication there (code, source/target details etc)?

allisonport-db avatar Sep 29 '22 18:09 allisonport-db