snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
RDB Shredder: make event_fingerprint mandatory
Migrated from https://github.com/snowplow/snowplow/issues/3445#issuecomment-333064293
Right now we're generating random UUID, which makes all natural duplicates synthetic. We should throw exception and abort shredding instead.
@alexanderdean one caveat though, not sure if critical.
If user has an enriched dataset which now needs to be loaded into relational database - it won't be possible without re-enriching raw logs.
They can keep dedupe disabled until they have caught up with where they enabled event fingerprinting?
No they cannot because there's also in-batch deduplication that cannot be disabled.
Hmm. We need to think about this some more.
We got stung by this today - what I would recommend is that we add a --force flag if you want to keep it optional but that by default it is mandatory and will not run.
This informs the user of the risk and if they wish to ignore they can.
event_fingerprint is required for both flavours of deduplication: natural and synthetic, because it is the only thing that allows us to differentiate between natural and synthetic dupes.
If there is no event_fingerprint in the data (eg, because the enrichment is not turned on), then we should not be doing any deduplication.