dlt icon indicating copy to clipboard operation
dlt copied to clipboard

`scd2` merge strategy does not reinsert records

Open jorritsandbrink opened this issue 1 year ago • 1 comments

dlt version

0.5.2

Describe the problem

When using the scd2 merge strategy, records do not get reinserted after retirement:

  • load 1: source snapshot contains record x ➜ record x is inserted in destination
  • load 2: source snapshot does not contain record x ➜ record x is retired in destination ("valid to" field is filled with timestamp)
  • load 3: source snapshot contains record x ➜ nothing happens (this is the bug)

Issue only occurs if record x is exactly the same. If the natural key is the same, but one or more attributes is different, the record gets inserted as expected.

Expected behavior

Record x gets reinserted in load 3. Resultant destination table has two records for record x with different validity windows.

Steps to reproduce

import dlt

@dlt.resource(
    table_name="my_table", write_disposition={"disposition": "merge", "strategy": "scd2"}
)
def r(data):
    yield data

pipe = dlt.pipeline(destination="postgres", dataset_name="scd2_reinsert_bug")

# load 1 — initial load
dim_snap = [
    {"nk": 1, "c1": "foo", "c2": "foo"},
    {"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: inserted 2 records (table size: 2 records)

# load 2 — delete natural key 1
dim_snap = [
    {"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: retired record for natural key 1 (table size: 2 records)

# load 3 — reinsert natural key 1
dim_snap = [
    {"nk": 1, "c1": "foo", "c2": "foo"},
    {"nk": 2, "c1": "bar", "c2": "bar"},
]
pipe.run(r(dim_snap))
# result: nothing happened (table size: 2 records)
# desired result: reinsert record for natural key 1 (table size: 3 records)

Operating system

Windows

Runtime environment

Local

Python version

3.8

dlt data source

No response

dlt destination

Postgres

Other deployment details

No response

Additional information

No response

jorritsandbrink avatar Aug 12 '24 08:08 jorritsandbrink

Challenge in fixing this is that (by default) the record hash gets stored in _dlt_id. Reinsertion of the same record (with the same hash) violates the uniqueness constraint of _dlt_id.

jorritsandbrink avatar Aug 12 '24 08:08 jorritsandbrink

best if user could add updated_at column or some kind of serial number tracking updates. (it could be used as row version instead of computing row hash which costs a lot)

why coming back of _dlt_id is a problem? all records with it are retired so if you insert a new one that is not retired, there is no violation I think

rudolfix avatar Aug 13 '24 17:08 rudolfix

best if user could add updated_at column or some kind of serial number tracking updates. (it could be used as row version instead of computing row hash which costs a lot)

I think the user can achieve this already with "bring your own hash". But we could make it more accessible by adding a surrogate_key option:

    write_disposition={
        "disposition": "merge",
        "strategy": "scd2",
        "surrogate_key": ("customer_id", "updated_at")
    }

However, this will not solve the bug this issue is about. The bug is about the case where each column in a source record is identical to each column in an earlier source record, including "updated_at" or any other kind of update tracking column.

why coming back of _dlt_id is a problem? all records with it are retired so if you insert a new one that is not retired, there is no violation I think

It violates because the validity columns ("valid from" / "valid to") are not part of the hash (the hash is calculated during normalization, the validity columns are added during load). A retired record has the same hash as a new active record if all column values in both records are the same.

jorritsandbrink avatar Aug 14 '24 08:08 jorritsandbrink