snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

RDB Loader: ensure we get earliest event when deduplicating

Open dilyand opened this issue 6 years ago • 0 comments

The current logic for natural deduplication does not guarantee that we always preserve the earliest event from a batch of duplicates: https://github.com/snowplow/snowplow-rdb-loader/blob/master/shredder/src/main/scala/com.snowplowanalytics.snowplow.storage/spark/ShredJob.scala#L415-L416 .

This can lead to confusing outcomes. Natural duplicates can have different collector_tstamps. If we keep, for example, the last one in the series, it might have a later timestamp than an event that happened before it but was collected successfully without duplication.

dilyand avatar Apr 01 '19 13:04 dilyand