How to do incremental update with new rows where update_key is not set
Suppose I have an Oracle table with two columns, ID (the primary key) and MDATUM (the update key).
| ID | MDATUM |
|---|---|
| 1 | NULL |
| 2 | NULL |
| 3 | 2024-09-18 02:00:00.000 |
Now I upload this table to Snowflake.
Next, one row is added to the table (ID=4), and one row is updated (ID=2), which means it gets an MDATUM:
so it looks like this:
| ID | MDATUM | note |
|---|---|---|
| 1 | NULL | |
| 2 | 2024-09-18 03:00:00.000 | this row is updated |
| 3 | 2024-09-18 02:00:00.000 | |
| 4 | NULL | this row is new |
How can get those two new rows into Snowflake using an incremental load?
When I load with
--mode incremental \
--primary-key 'ID' \
--update-key 'MDATUM' \
then I get this table:
| ID | MDATUM | note |
|---|---|---|
| 1 | NULL | |
| 2 | 2024-09-18 03:00:00.000 | correctly updated |
| 3 | 2024-09-18 02:00:00.000 |
As you can see, ID=3 is updated as expected. The new row (ID=4), on the other hand, is not inserted into the target table.
I believe this is in line with your documentation.
However, from a practical standpoint, this is not helpful. My example above is a real-life example from our production table. And it makes sense: If a new row is added then that's not an update, and thus the update_key is empty. Only if a row is modified then the update_key is set, which makes sense.
I therefore believe there should be another load strategy that loads "new records after max(update_key)" plus "new records with primary_key larger then max(primary_key).
A workaround is I can make run Sling twice:
First, I load using these parameters:
--mode incremental \
--primary-key 'ID' \
--update-key 'MDATUM' \
This gets me all rows where the update_key MDATUM is newer then max(MDATUM) in my target table, i.e., all updated rows.
Then I run Sling again and this time I use:
--mode incremental \
--primary-key 'ID' \
--update-key 'ID' \
This should get me all IDs where the ID is higher then max(ID) in my target table, i.e., all new rows.
I feel my use-case is quite plausible and should be quite widespread.
What are your thoughts on this? And do you see a problem with my workaround?
Would love to know any feed back on if using ID for both primary and update key worked?
@justin-fundrise it works in practice. It will replace your existing records in place