snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

Use derived_tstamp as the primary tstamp in Redshift

Open yalisassoon opened this issue 9 years ago • 5 comments

Currently we use the collector_tstamp for the root_tstamp value. It would be preferable to use the derived_tstamp once all our client side trackers support generating a dvce_sent_tstamp. (Because that point from an analytics perspective you're only interested in the derived_tstamp.

We need to figure out how we migrate from collector_tstamp -> derived_tstamp e.g. what happens for old users who have events without derived_tstamp values.

yalisassoon avatar Feb 19 '16 14:02 yalisassoon

This impacts the table SORTKEY as well

yalisassoon avatar Feb 19 '16 14:02 yalisassoon

I noticed sql-runner appears to still be using collector_tstamp instead of derived_tstamp as well, so you guys may want to update that when you get to this issue

tdevitt avatar May 31 '16 21:05 tdevitt

We need to test whether this leads to tables that are more unsorted after inserting new events.

bogaert avatar Jul 05 '16 20:07 bogaert

It'd be interesting to explore the impact of this patch: https://forums.aws.amazon.com/ann.jspa?annID=4157

Data loading enhancement. If you load your data in sort key order using a compound sort key with only one sort column, you might now reduce or even eliminate the need to vacuum as the COPY command automatically adds new rows in sort order to the table's sorted region

This should eliminate the need to vacuum if we continue to use the collector timestamp, and reduce (but not eliminate) this need if we were to switch to the derived timestamp.

bogaert avatar Nov 21 '16 15:11 bogaert

Likely this is no longer a good idea, but moving over to review

alexanderdean avatar Apr 28 '21 16:04 alexanderdean