dlt icon indicating copy to clipboard operation
dlt copied to clipboard

convert sql merge jobs into MERGE sql statement

Open rudolfix opened this issue 1 year ago • 1 comments

Background dlt uses DELETE + INSERT to merge staging dataset into destination. It will also create temporary tables with _dlt_id of rows to be deleted and inserted. This may all not be necessary if MERGE sql is used. On supported database backend, MERGE looks even more standardized that various forms of DELETE that we had to hack.

Tasks

    • [ ] look at replace strategies and add merge strategies to config
    • [ ] pick a right merge job depending on merge strategy
    • [ ] make sure the deduplication of the staging datasets by primary key also takes place (hope MERGE will do it automatically)
    • [ ] which destinations to convert: duckdb (testing), bigquery, snowflake. postgres should work if duckdb works
    • [ ] tests and docs!

rudolfix avatar Mar 22 '24 09:03 rudolfix

See this draft PR for relevant discussions and code inspiration: https://github.com/dlt-hub/dlt/pull/1294

jorritsandbrink avatar May 02 '24 13:05 jorritsandbrink

My take, based on previous work in https://github.com/dlt-hub/dlt/pull/1294:

Add a new key-based merge strategy called upsert:

  • if key is not present in destination table ➜ insert record
  • if key is present in destination table ➜ update record

Source data can be both full or incremental extracts.

Constraints:

  • needs a primary_key
  • does not support merge_key (we need a one-to-one relationship between records in source and destination)

Furthermore, as with delete-insert:

  • do primary key-based deduplication
  • support dedup_sort and hard_delete column hints

Implementation:

  • do real updates as much as possible, i.e. no delete+insert mechanism
    • use the MERGE command to implement it for SQL destinations that support it
    • use INSERT INTO ... ON CONFLICT DO UPDATE on DuckDB (reference)
    • use the TableMerger API for delta table format on filesystem destination (implement in separate PR)
  • child tables still need deletes, to delete records for elements which have been removed from an updated list
  • hash primary key and store as _dlt_id in root table
    • use this as upsert key

jorritsandbrink avatar Jun 06 '24 21:06 jorritsandbrink

@jorritsandbrink my biggest question to the above: what you can do with it that you cannot do with the current merge? can you update from partial records? (on first look we'd need to add something on top to do that)

I had something more modest in mind:

  1. replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX
  2. implement existing merge strategy for delta (using their TableMerger API whenever possible, INSERT DELETE otherwise)

rudolfix avatar Jun 10 '24 14:06 rudolfix

@rudolfix

"what you can do with it that you cannot do with the current merge?"

There's no extra functionality indeed. Clear semantics would be the main benefit. People are familiar with key-based "upsert", and know which behavior to expect.

"replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX"

An issue I see here: the merge strategy is called delete-insert. It will be confusing if we mix MERGE behavior into it.

Also, what is the benefit? MERGE does not always outperform DELETE + INSERT. There's a lot of content online comparing the two approaches, for different database backends, and it doesn't seem either one of the approaches is clearly more performant.

It would be challenging to elegantly mix both approaches. I think it makes more sense to give the user an option:

  • delete-insert based on DELETE and INSERT SQL statements
    • works with merge_key and doesn't require primary_key
  • upsert based on MERGE statement
    • does not work with merge_key and requires primary_key

It's a leaky abstraction, but it enables the user to use the approach that is most performant for their workload.

What do you think?

jorritsandbrink avatar Jun 11 '24 10:06 jorritsandbrink

@jorritsandbrink yeah OK. I like the upsert idea.

  1. How are you planning to handle child tables? DELETE + INSERT? we may try to merge on root key id + unique column which is deterministic
  2. How long is it going to take?
  3. Do you plan to implement both strategies for delta?

rudolfix avatar Jun 11 '24 11:06 rudolfix

@rudolfix

  1. I would approach it like this:
  • use DELETE + INSERT as base implementation
  • use MERGE if WHEN NOT MATCHED BY SOURCE is supported (e.g. on mssql and databricks)

Note that the DELETE + INSERT approach only deletes updated records, so it differs from the delete-insert strategy, which deletes all records.

  1. We might be able to merge functionality for the new upsert strategy in one week. Perhaps a little longer if the different SQL destinations require much customization. I can use code from https://github.com/dlt-hub/dlt/pull/1294. Support for delta would be a separate PR.

  2. I'd say only upsert and scd2 at first and do delete-insert if/when requested in the community.

jorritsandbrink avatar Jun 11 '24 12:06 jorritsandbrink

@jorritsandbrink ad 1. how are you going to deal with deleted child records? ie. you have a list of strings in your main document and the list will get shorter. you need to be able to delete everything with list_idx > previous max list_idx

ad 2. OK! could we implement that for Snowflake first and try to merge it before other destinations? if that's adding too much work.

ad 3. Ok! I thing our initial merge strategy is the most powerful but I'll follow what you say. So we'll make upsert the default strategy for delta?

rudolfix avatar Jun 11 '24 14:06 rudolfix

@rudolfix

  1. I would do it based on root key and unique key like last time: https://github.com/dlt-hub/dlt/pull/1294/files#r1635024156
  2. I will do Snowflake first.
  3. Yes, upsert would be default for delta.

jorritsandbrink avatar Jun 11 '24 14:06 jorritsandbrink

@jorritsandbrink can we close it?

rudolfix avatar Jun 16 '24 21:06 rudolfix

@rudolfix I think we should close this ticket after #1466 is merged. I will create a new ticket for upsert support for destinations beyond postgres and snowflake.

jorritsandbrink avatar Jun 17 '24 08:06 jorritsandbrink