dlt
dlt copied to clipboard
convert sql merge jobs into MERGE sql statement
Background
dlt uses DELETE + INSERT to merge staging dataset into destination. It will also create temporary tables with _dlt_id of rows to be deleted and inserted. This may all not be necessary if MERGE sql is used. On supported database backend, MERGE looks even more standardized that various forms of DELETE that we had to hack.
Tasks
-
- [ ] look at replace strategies and add merge strategies to config
-
- [ ] pick a right merge job depending on merge strategy
-
- [ ] make sure the deduplication of the staging datasets by primary key also takes place (hope MERGE will do it automatically)
-
- [ ] which destinations to convert: duckdb (testing), bigquery, snowflake. postgres should work if duckdb works
-
- [ ] tests and docs!
See this draft PR for relevant discussions and code inspiration: https://github.com/dlt-hub/dlt/pull/1294
My take, based on previous work in https://github.com/dlt-hub/dlt/pull/1294:
Add a new key-based merge strategy called upsert:
- if key is not present in destination table ➜ insert record
- if key is present in destination table ➜ update record
Source data can be both full or incremental extracts.
Constraints:
- needs a
primary_key - does not support
merge_key(we need a one-to-one relationship between records in source and destination)
Furthermore, as with delete-insert:
- do primary key-based deduplication
- support
dedup_sortandhard_deletecolumn hints
Implementation:
- do real updates as much as possible, i.e. no delete+insert mechanism
- use the
MERGEcommand to implement it for SQL destinations that support it - use
INSERT INTO ... ON CONFLICT DO UPDATEon DuckDB (reference) - use the TableMerger API for
deltatable format onfilesystemdestination (implement in separate PR)
- use the
- child tables still need deletes, to delete records for elements which have been removed from an updated list
- hash primary key and store as
_dlt_idin root table- use this as upsert key
@jorritsandbrink my biggest question to the above: what you can do with it that you cannot do with the current merge? can you update from partial records? (on first look we'd need to add something on top to do that)
I had something more modest in mind:
- replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX
- implement existing merge strategy for delta (using their TableMerger API whenever possible, INSERT DELETE otherwise)
@rudolfix
"what you can do with it that you cannot do with the current merge?"
There's no extra functionality indeed. Clear semantics would be the main benefit. People are familiar with key-based "upsert", and know which behavior to expect.
"replace INSERT / DELETE with MERGE wherever we can and make it somehow optional - but this is just internal change without impact on the UX"
An issue I see here: the merge strategy is called delete-insert. It will be confusing if we mix MERGE behavior into it.
Also, what is the benefit? MERGE does not always outperform DELETE + INSERT. There's a lot of content online comparing the two approaches, for different database backends, and it doesn't seem either one of the approaches is clearly more performant.
It would be challenging to elegantly mix both approaches. I think it makes more sense to give the user an option:
delete-insertbased onDELETEandINSERTSQL statements- works with
merge_keyand doesn't requireprimary_key
- works with
upsertbased onMERGEstatement- does not work with
merge_keyand requiresprimary_key
- does not work with
It's a leaky abstraction, but it enables the user to use the approach that is most performant for their workload.
What do you think?
@jorritsandbrink
yeah OK. I like the upsert idea.
- How are you planning to handle child tables? DELETE + INSERT? we may try to merge on root key id + unique column which is deterministic
- How long is it going to take?
- Do you plan to implement both strategies for delta?
@rudolfix
- I would approach it like this:
- use
DELETE+INSERTas base implementation - use
MERGEifWHEN NOT MATCHED BY SOURCEis supported (e.g. onmssqlanddatabricks)
Note that the DELETE + INSERT approach only deletes updated records, so it differs from the delete-insert strategy, which deletes all records.
-
We might be able to merge functionality for the new
upsertstrategy in one week. Perhaps a little longer if the different SQL destinations require much customization. I can use code from https://github.com/dlt-hub/dlt/pull/1294. Support fordeltawould be a separate PR. -
I'd say only
upsertandscd2at first and dodelete-insertif/when requested in the community.
@jorritsandbrink ad 1. how are you going to deal with deleted child records? ie. you have a list of strings in your main document and the list will get shorter. you need to be able to delete everything with list_idx > previous max list_idx
ad 2. OK! could we implement that for Snowflake first and try to merge it before other destinations? if that's adding too much work.
ad 3. Ok! I thing our initial merge strategy is the most powerful but I'll follow what you say. So we'll make upsert the default strategy for delta?
@rudolfix
- I would do it based on root key and unique key like last time: https://github.com/dlt-hub/dlt/pull/1294/files#r1635024156
- I will do Snowflake first.
- Yes,
upsertwould be default fordelta.
@jorritsandbrink can we close it?
@rudolfix I think we should close this ticket after #1466 is merged. I will create a new ticket for upsert support for destinations beyond postgres and snowflake.