airbyte
airbyte copied to clipboard
Tech Spec for mitigation of record growth in nested tables
In incremental/dedup mode, nested tables can grow at a polynomial rate, which increases with each level of nesting and each subsequent run of normalization. Fixing deduplication of nested tables may be a bigger project, but to begin, we need to fix the bug that causes nested tables to grow so fast.
A couple of approaches have been discussed so far:
- replace
>=
with>
(see here a discussion of why we are using>=
instead of>
). This may not be safe, becauseairbyte_emitted_at
is generated by source connectors and there is no guarantee that it will be unique per job - add another field to dedup on that is guaranteed to grow monotonically (e.g. job ID added by platform or destination connector)
- move adding of
airbyte_emitted_at
from source to platform or destination
Timeboxing this to 13 points
spec is out for review; linking for posterity https://docs.google.com/document/d/1Ep6_jojbi5FdHEI-c1W_DrOCylmQJFeU1xvKM6oK-rc/edit#heading=h.fa72b39y2m99