dlt
dlt copied to clipboard
`scd2` merge strategy does not reinsert records
dlt version
0.5.2
Describe the problem
When using the scd2 merge strategy, records do not get reinserted after retirement:
- load 1: source snapshot contains record x ➜ record x is inserted in destination
- load 2: source snapshot does not contain record x ➜ record x is retired in destination ("valid to" field is filled with timestamp)
- load 3: source snapshot contains record x ➜ nothing happens (this is the bug)
Issue only occurs if record x is exactly the same. If the natural key is the same, but one or more attributes is different, the record gets inserted as expected.
Expected behavior
Record x gets reinserted in load 3. Resultant destination table has two records for record x with different validity windows.
Steps to reproduce
import dlt
@dlt.resource(
table_name="my_table", write_disposition={"disposition": "merge", "strategy": "scd2"}
)
def r(data):
yield data
pipe = dlt.pipeline(destination="postgres", dataset_name="scd2_reinsert_bug")
# load 1 — initial load
dim_snap = [
{"nk": 1, "c1": "foo", "c2": "foo"},
{"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: inserted 2 records (table size: 2 records)
# load 2 — delete natural key 1
dim_snap = [
{"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: retired record for natural key 1 (table size: 2 records)
# load 3 — reinsert natural key 1
dim_snap = [
{"nk": 1, "c1": "foo", "c2": "foo"},
{"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: nothing happened (table size: 2 records)
# desired result: reinsert record for natural key 1 (table size: 3 records)
Operating system
Windows
Runtime environment
Local
Python version
3.8
dlt data source
No response
dlt destination
Postgres
Other deployment details
No response
Additional information
No response
Challenge in fixing this is that (by default) the record hash gets stored in _dlt_id. Reinsertion of the same record (with the same hash) violates the uniqueness constraint of _dlt_id.
best if user could add updated_at column or some kind of serial number tracking updates. (it could be used as row version instead of computing row hash which costs a lot)
why coming back of _dlt_id is a problem? all records with it are retired so if you insert a new one that is not retired, there is no violation I think
best if user could add updated_at column or some kind of serial number tracking updates. (it could be used as row version instead of computing row hash which costs a lot)
I think the user can achieve this already with "bring your own hash". But we could make it more accessible by adding a surrogate_key option:
write_disposition={
"disposition": "merge",
"strategy": "scd2",
"surrogate_key": ("customer_id", "updated_at")
}
However, this will not solve the bug this issue is about. The bug is about the case where each column in a source record is identical to each column in an earlier source record, including "updated_at" or any other kind of update tracking column.
why coming back of _dlt_id is a problem? all records with it are retired so if you insert a new one that is not retired, there is no violation I think
It violates because the validity columns ("valid from" / "valid to") are not part of the hash (the hash is calculated during normalization, the validity columns are added during load). A retired record has the same hash as a new active record if all column values in both records are the same.