snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Use derived_tstamp as the primary tstamp in Redshift
Currently we use the collector_tstamp for the root_tstamp value.
It would be preferable to use the derived_tstamp once all our client side trackers support generating a dvce_sent_tstamp. (Because that point from an analytics perspective you're only interested in the derived_tstamp.
We need to figure out how we migrate from collector_tstamp -> derived_tstamp e.g. what happens for old users who have events without derived_tstamp values.
This impacts the table SORTKEY as well
I noticed sql-runner appears to still be using collector_tstamp instead of derived_tstamp as well, so you guys may want to update that when you get to this issue
We need to test whether this leads to tables that are more unsorted after inserting new events.
It'd be interesting to explore the impact of this patch: https://forums.aws.amazon.com/ann.jspa?annID=4157
Data loading enhancement. If you load your data in sort key order using a compound sort key with only one sort column, you might now reduce or even eliminate the need to vacuum as the COPY command automatically adds new rows in sort order to the table's sorted region
This should eliminate the need to vacuum if we continue to use the collector timestamp, and reduce (but not eliminate) this need if we were to switch to the derived timestamp.
Likely this is no longer a good idea, but moving over to review