enrich icon indicating copy to clipboard operation
enrich copied to clipboard

Allow configuration of rule for when derived_tstamp should default to collector_tstamp.

Open colmsnowplow opened this issue 5 years ago • 4 comments
trafficstars

We have seen issues in the past where derived_tstamp can be very wrong value due to changing clocks in between the two dvce tstamps, changing timezones, and now, more recently, simply seeing erroneous tstamps that occur rarely.

Because BQ can use derived tstamp as partition key, this data can fail at load - but also it's an annoying problem for using the data/modeling the data (where derived is the best tstamp to use if correct).

A config that allowed for a simple rule to say when it should default to collector tstamp works around the general case of this issue I believe. Something along the lines of:

  • Compute the derived_tstamp.
  • IFF derived_tstamp > X old (eg 1 year), use collector_tstamp, ELSE use derived_tstamp.

Note: these tstamps are generally only problematic in a significant when they're some very incorrect date like 1+ year ago. Just being somewhat incorrect in rare cases is fine and manageable.

colmsnowplow avatar Jul 30 '20 09:07 colmsnowplow

Currently there's no configuration option that can specify behavior for such generic fields as derived_tstamp. Maybe we should have one, maybe we should hardcode some reasonable timeout.

If we pick up a certain timeout (discard period) we should have a clear explanation why we chose exactly that period, e.g.

  • 1 year because this some cookie expiration period
  • 2 years because it breaks some loading
  • Since 2012 because Snowplow didn't exist before that etc

chuwy avatar Jul 30 '20 10:07 chuwy

So my only concern about doing things that way is that we'd like people to be able to reprocess data from, let's say, 3 years ago.

No matter what timeout we choose, we could imagine a scenario where that becomes an issue. So perhap ability to toggle that on/off would do if we can't specify the timeout in enrich config.

colmsnowplow avatar Jul 30 '20 10:07 colmsnowplow

That's a very good point.

chuwy avatar Jul 30 '20 10:07 chuwy

Or... Thinking another step forwards... I think that the problems only arise when there's a big distance between the two dvce tstamps. So we maybe we could hardcode a max distance between those.

colmsnowplow avatar Jul 30 '20 10:07 colmsnowplow