snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Explore using time-series tables in Redshift
e.g. for atomic.events
This would make it easier to delete data that has expired without having to run expensive VACUUM
operations
This is super-interesting! Related tickets: snowplow/snowplow#2457, snowplow/snowplow#953
Some open questions:
- Do we just do atomic.events, or also all shredded tables, or some shredded tables?
- Day tables, week tables or month tables? Probably configurable
- Redshift has a limit of 9,900 tables per cluster (Redshift limits) to factor in
- Do we partition based on etl_tstamp or derived_tstamp? If the latter, then any given load could obviously be loading into many many tables (because a given load can have derived timestamps from multiple days)
- Note that Redshift doesn't have an equivalent of BigQuery's table wildcard functions, but we could add this as a pre-processor in a Snowplow SQL Analytics SDK
Moving to RDB Loader repo