automate-dv icon indicating copy to clipboard operation
automate-dv copied to clipboard

[BUG] Duplicate records loaded for same hashdiffs

Open DVAlexHiggs opened this issue 2 years ago • 1 comments

Describe the bug When loading a satellite incrementally, we can see duplicates come in if records with the same hashdiff are loaded twice.

Environment

dbt version: 1.1.0 dbtvault version: 0.8.3 Database/Platform: All

To Reproduce Steps to reproduce the behavior:

  1. Load a Satellite
  2. Load the same satellite again with the same records
  3. See duplicates

Expected behaviour Idempotent load, no duplicates

Additional context Failing test:

@fixture.satellite
Scenario: [SAT-IM-10] Load data into a non-existent satellite, where we have two records with different payloads and same keys
  Given the SATELLITE table does not exist
  And the RAW_STAGE table contains data
    | CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE  | LOAD_DATE  | SOURCE |
    | 1001        | Alice         | 1997-04-24   | 17-214-233-1214 | 1993-01-01 | *      |
    | 1001        | Alice         | 1997-04-24   | 17-214-233-1215 | 1993-01-01 | *      |
  And I stage the STG_CUSTOMER data
  And I load the SATELLITE sat
  And the RAW_STAGE table contains data
    | CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE  | LOAD_DATE  | SOURCE |
    | 1001        | Alice         | 1997-04-24   | 17-214-233-1214 | 1993-01-01 | *      |
    | 1001        | Alice         | 1997-04-24   | 17-214-233-1215 | 1993-01-01 | *      |
  And I stage the STG_CUSTOMER data
  And I load the SATELLITE sat
  Then the SATELLITE table should contain expected data
    # We get 4 records here instead of 2
    | CUSTOMER_PK | HASHDIFF                                              | CUSTOMER_NAME | CUSTOMER_PHONE  | CUSTOMER_DOB | EFFECTIVE_FROM | LOAD_DATE  | SOURCE |
    | md5('1001') | md5('1997-04-24\|\|1001\|\|ALICE\|\|17-214-233-1214') | Alice         | 17-214-233-1214 | 1997-04-24   | 1993-01-01     | 1993-01-01 | *      |
    | md5('1001') | md5('1997-04-24\|\|1001\|\|ALICE\|\|17-214-233-1215') | Alice         | 17-214-233-1215 | 1997-04-24   | 1993-01-01     | 1993-01-01 | *      |

Why have we not seen this before? We didn't have test coverage for this specific situation, as I believe it is something to do with how hashdiffs are compared vs the latest records.

DVAlexHiggs avatar Jun 21 '22 09:06 DVAlexHiggs