materialize
materialize copied to clipboard
storage: Design doc for Source Versioning / Tables from Sources [ENG-TASK-15]
Motivation
Tips for reviewer
Checklist
- [ ] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
- [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
- [ ] If this PR evolves an existing
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
- [ ] This PR includes the following user-facing behavior changes:
Migrating a private discussion from Slack around backfilling behavior for source tables:
Something I was mulling over after that user thread: I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1. For example, if they have a 7-day retention period in Kafka and create a source table v0, then after 7 days need to evolve the schema and create a source table v1, the blue/green process would cause all the green downstream objects to hydrate based on the new data only. Naively, I'd say this looks like a UNION ($?) between v0 and v1, but...we might need to think about how to cleanly approach this (probably via dbt). We've discussed keeping historical state around for in-place schema changes (i.e. adding new columns, but possibly keeping around the data for old columns that were previously ingested even if they're dropped), but here we're just assuming users are okay bootstrapping their source and its dependencies from scratch.
cc @sdht0, who separately also brought this up in conversation with @rjobanp.
I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1.
Thanks for including that here @morsapaes ! It's a great question and will certainly be important to make this feature usable in real-world scenarios.
My initial thought is that we should try to backfill the new v1 source table using the usual snapshot functionality present in our sources, though this would only be able to make data available at timestamps that are still preserved in the upstream system's replication log (e.g. Postgres WAL, MySQL binlog, kafka topic). Hopefully this will be acceptable for most usecases, but it's worth thinking about whether there are better options (such as the union idea you've proposed).
All contributors have signed the CLA.
Posted by the CLA Assistant Lite bot.
I expect users to ask us for a solution that allows them to not discard all state of the v0 source table in v1. For example, if they have a 7-day retention period in Kafka and create a source table v0, then after 7 days need to evolve the schema and create a source table v1, the blue/green process would cause all the green downstream objects to hydrate based on the new data only.
The same problem exists with schema evolution in materialized views which might have been retained for multiple days but their inputs have been compacted so whatever solution we come up with should be compatible (and ideally identical) to that problem
New webhook design LGTM! 🙇🏽
This PR has gotten fairly big and since we've agreed on all the open discussion points I'm going to merge and apply any future updates to the doc in separate PRs