delta
delta copied to clipboard
Duplicates seen with merge operation
deltaTable.alias("original")
.merge(batch_of_records.alias("updates"), "original.id = updates.id")
..whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
ID column which was used in this example is unique.
Hi @Kiran-G1 can you share a more complete example with sample data? A full reproduction will help us confirm and track this down. Here is a good example, https://github.com/delta-io/delta/issues/1279
There is a lot of very relevant discussion on duplicates in merge here - https://github.com/delta-io/delta/issues/527
I partially replicated this issue with a different use case.
Use case is to update the old record's datetime to current time and latest record datetime to 2050 in delta by performing merge operation when a new record comes.
But all of the records were considered as new records...
I noticed that all those records got inserted at the same time ( see the load time) into the delta table.
My hunch is, since delta table write happens in parallel by spark and all these records got inserted at the same time , due to this race condition merge condition didn't satisfy.
@tdas @nkarpov
Can we focus this discussion on the original issue #527? This is the same problem. Can you add more information on this replication there (code, source/target details etc)?