dlt
dlt copied to clipboard
disable rows deduplication if Incremental is attached to a resource with `merge` write disposition
Feature description
Incremental will remove duplicating rows (only with the same cursor field value, read the docs) based on content hash or primary key. This is not expected behavior when merge key is present - destination should be allowed to merge changes coming from the new records. We want to disable "deduplication" in Incremental and expose destination to all the rows from data source.
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
Please see https://github.com/dlt-hub/dlt/issues/971#issuecomment-1983417044
Proposed solution
-
- [x] disable deduplication by setting
primary_key
of Incremental to()
(disable dedup) when resource has "merge" write disposition. we already propagate resource key to Incremental key so change should be easy
- [x] disable deduplication by setting
-
- [ ] keep explicitly set Incremental primary key (should already work like that)
-
- [ ] observe
apply_hints
changes in the resource (_set_hints
) and if primary
- [ ] observe
-
- [ ] make sure that using
with_hints
marker that is executed during resource execution is able to apply write disposition and disable the deduplication
- [ ] make sure that using
-
- [x] also test for transformers that
bind()
the incremental late
- [x] also test for transformers that
-
- [x] warn if dedup state grows too big ie. > 200 entries
Related issues
No response
Implementation strategy:
- read the write_disposition from the table_schema_template in DltResource._set_hints()
- read the write_disposition from the hints_template in DltResourceHints._set_hints()
- set the primary key of the incremental object to
()
because if it's aSequence
and emty it causes no deduplication.