airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

Tech Spec for mitigation of record growth in nested tables

Open grishick opened this issue 2 years ago • 2 comments

In incremental/dedup mode, nested tables can grow at a polynomial rate, which increases with each level of nesting and each subsequent run of normalization. Fixing deduplication of nested tables may be a bigger project, but to begin, we need to fix the bug that causes nested tables to grow so fast.

A couple of approaches have been discussed so far:

  • replace >= with > (see here a discussion of why we are using >= instead of >). This may not be safe, because airbyte_emitted_at is generated by source connectors and there is no guarantee that it will be unique per job
  • add another field to dedup on that is guaranteed to grow monotonically (e.g. job ID added by platform or destination connector)
  • move adding of airbyte_emitted_at from source to platform or destination

grishick avatar Jan 25 '23 01:01 grishick

Timeboxing this to 13 points

grishick avatar Jan 25 '23 01:01 grishick

spec is out for review; linking for posterity https://docs.google.com/document/d/1Ep6_jojbi5FdHEI-c1W_DrOCylmQJFeU1xvKM6oK-rc/edit#heading=h.fa72b39y2m99

edgao avatar Jan 27 '23 19:01 edgao