automate-dv
automate-dv copied to clipboard
[BUG] Duplicate records loaded for same hashdiffs
Describe the bug When loading a satellite incrementally, we can see duplicates come in if records with the same hashdiff are loaded twice.
Environment
dbt version: 1.1.0 dbtvault version: 0.8.3 Database/Platform: All
To Reproduce Steps to reproduce the behavior:
- Load a Satellite
- Load the same satellite again with the same records
- See duplicates
Expected behaviour Idempotent load, no duplicates
Additional context Failing test:
@fixture.satellite
Scenario: [SAT-IM-10] Load data into a non-existent satellite, where we have two records with different payloads and same keys
Given the SATELLITE table does not exist
And the RAW_STAGE table contains data
| CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE | LOAD_DATE | SOURCE |
| 1001 | Alice | 1997-04-24 | 17-214-233-1214 | 1993-01-01 | * |
| 1001 | Alice | 1997-04-24 | 17-214-233-1215 | 1993-01-01 | * |
And I stage the STG_CUSTOMER data
And I load the SATELLITE sat
And the RAW_STAGE table contains data
| CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE | LOAD_DATE | SOURCE |
| 1001 | Alice | 1997-04-24 | 17-214-233-1214 | 1993-01-01 | * |
| 1001 | Alice | 1997-04-24 | 17-214-233-1215 | 1993-01-01 | * |
And I stage the STG_CUSTOMER data
And I load the SATELLITE sat
Then the SATELLITE table should contain expected data
# We get 4 records here instead of 2
| CUSTOMER_PK | HASHDIFF | CUSTOMER_NAME | CUSTOMER_PHONE | CUSTOMER_DOB | EFFECTIVE_FROM | LOAD_DATE | SOURCE |
| md5('1001') | md5('1997-04-24\|\|1001\|\|ALICE\|\|17-214-233-1214') | Alice | 17-214-233-1214 | 1997-04-24 | 1993-01-01 | 1993-01-01 | * |
| md5('1001') | md5('1997-04-24\|\|1001\|\|ALICE\|\|17-214-233-1215') | Alice | 17-214-233-1215 | 1997-04-24 | 1993-01-01 | 1993-01-01 | * |
Why have we not seen this before? We didn't have test coverage for this specific situation, as I believe it is something to do with how hashdiffs are compared vs the latest records.