meltano icon indicating copy to clipboard operation
meltano copied to clipboard

feature: New Load Method upsert-soft-delete

Open haleemur opened this issue 4 months ago • 0 comments

Feature scope

Other

Description

Description

Introduce a new load-method, called upsert-soft-delete.

This load method should work like an upsert and additionally soft delete records when a full-refresh is performed.

Scenario where this would be useful.

When records may be deleted from the source and the source does not implement soft-delete nor track deletes via any other mechanism.

Deletion Tracking is a limitation of incremental syncs. Our options are limited to:

  • Log Based Replication
  • Soft Delete in Source
  • Database Trigger based deletion tracking table

Each of these options have drawbacks, and may not even be possible in many source systems, as often the sources are not designed with data integration in mind.

In these cases data teams often resort to the following strategy

  • run incremental syncs at a high frequency (if possible, as in, there exists a monotic replication key and it can be used as a filter)
  • run full refresh at a lower frequency with overwrite

The deleted records get purged in the destination system when the full refresh is run with load method overwrite.

If a data team needs to track which records were deleted, then additional logic needs to be implemented in the downstream data pipepline.

Proposed Behaviour

When a full refresh is run and the load method is upsert-soft-delete, then the singer target should emit the following sql (in posgresql dialect)

-- first upsert
insert into my_final_table (id, col1, col2, col3)
select id, col1, col2, col3 from my_staging_table
on conflict (id) do update
set col1 = excluded.col1,
    col2 = excluded.col2,
    col3 = excluded.col3;

-- then soft delete 
update my_final_table 
set _sdc_deleted_at = current_timestamp
where _sdc_deleted_at is null
  and id not in (select id from my_staging_table);

This method of tracking soft delete would have an accuracy defined by how frequently the full refresh is performed. for instance, if the full refresh is performed every 24 hours, and the incremental refresh is performed every hour (except when the full refresh is done), then the deletion tracking timestamp may be off by as much as 24 hours.

haleemur avatar Oct 17 '24 06:10 haleemur