snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

RDB Shredder: make event_fingerprint mandatory

Open chuwy opened this issue 8 years ago • 6 comments

Migrated from https://github.com/snowplow/snowplow/issues/3445#issuecomment-333064293

Right now we're generating random UUID, which makes all natural duplicates synthetic. We should throw exception and abort shredding instead.

chuwy avatar Sep 29 '17 08:09 chuwy

@alexanderdean one caveat though, not sure if critical.

If user has an enriched dataset which now needs to be loaded into relational database - it won't be possible without re-enriching raw logs.

chuwy avatar Sep 29 '17 08:09 chuwy

They can keep dedupe disabled until they have caught up with where they enabled event fingerprinting?

alexanderdean avatar Sep 29 '17 08:09 alexanderdean

No they cannot because there's also in-batch deduplication that cannot be disabled.

chuwy avatar Sep 29 '17 08:09 chuwy

Hmm. We need to think about this some more.

alexanderdean avatar Sep 29 '17 08:09 alexanderdean

We got stung by this today - what I would recommend is that we add a --force flag if you want to keep it optional but that by default it is mandatory and will not run.

This informs the user of the risk and if they wish to ignore they can.

jbeemster avatar Nov 21 '18 12:11 jbeemster

event_fingerprint is required for both flavours of deduplication: natural and synthetic, because it is the only thing that allows us to differentiate between natural and synthetic dupes.

If there is no event_fingerprint in the data (eg, because the enrichment is not turned on), then we should not be doing any deduplication.

dilyand avatar Apr 03 '19 13:04 dilyand