dlt icon indicating copy to clipboard operation
dlt copied to clipboard

disable rows deduplication if Incremental is attached to a resource with `merge` write disposition

Open rudolfix opened this issue 11 months ago • 1 comments

Feature description

Incremental will remove duplicating rows (only with the same cursor field value, read the docs) based on content hash or primary key. This is not expected behavior when merge key is present - destination should be allowed to merge changes coming from the new records. We want to disable "deduplication" in Incremental and expose destination to all the rows from data source.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Please see https://github.com/dlt-hub/dlt/issues/971#issuecomment-1983417044

Proposed solution

    • [x] disable deduplication by setting primary_key of Incremental to () (disable dedup) when resource has "merge" write disposition. we already propagate resource key to Incremental key so change should be easy
    • [ ] keep explicitly set Incremental primary key (should already work like that)
    • [ ] observe apply_hints changes in the resource (_set_hints) and if primary
    • [ ] make sure that using with_hints marker that is executed during resource execution is able to apply write disposition and disable the deduplication
    • [x] also test for transformers that bind() the incremental late
    • [x] warn if dedup state grows too big ie. > 200 entries

Related issues

No response

rudolfix avatar Mar 22 '24 10:03 rudolfix

Implementation strategy:

  1. read the write_disposition from the table_schema_template in DltResource._set_hints()
  2. read the write_disposition from the hints_template in DltResourceHints._set_hints()
  3. set the primary key of the incremental object to () because if it's a Sequence and emty it causes no deduplication.

willi-mueller avatar Sep 27 '24 13:09 willi-mueller