feat(duckdb): add microbatch support for incremental strategies
Added version-aware support for MERGE statements in DuckDB adapter for versions 1.4.0 and above. Introduced _supports_merge property to check and cache version compatibility, enabling "merge" and "microbatch" incremental strategies alongside existing "append" and "delete+insert". Updated SQL macros to implement microbatch and merge logic for efficient batch processing and upserts.
Ah @gfrmin thanks for this-- we also have #605 that adds support for the merge strategy and has a richer set of configs for it afaict.
I don't normally need to rectify two different impls of the same feature, happy to discuss this in the dbt Slack or DuckDB discord
Oh wow, thanks for the tip! Will join the Slack to deduplicate efforts :)
@gfrmin something unhappy here in this latest run: https://github.com/duckdb/dbt-duckdb/actions/runs/18803285639/job/53661107934?pr=644
I updated how timestamps in the fixtures are defined to overcome 'timestamp is not UTC' errors
@gfrmin okay, 605 is in-- you should be good to rebase on main and then we can just focus on the microbatch stuff!
Hey! Great effort already. I have a couple of thoughts on this after trying it out on the clickbench dataset and seeing some big differences in performance and out-of-memory issues.
The current microbatch implementation uses the new merge strategy, but falls back to delete+insert when <=1.4.0. However, those strategies are pretty different. The delete+insert does where ({{ unique_key }}) in (select ({{ unique_key }}) from {{ source }}) on a non-composite key for deletion and then inserts. Whereas the merge materialization matches the predicates (similar to composite keys on delete+insert).
From a performance perspective I think we should think about microbatching like this:
- batches get inserted by date part (hour/day/month/etc)
- row groups are optimised by this date part without explicit ordering because of the insertion time
- deleting existing row groups and replacing them should be fast because the query planner can prune on something like
date_trunc(event_time, 'day')based on the microbatching configuration
This has the added benefit that when ducklake is fully supported, you can do actual partition overwrites on the underlying folders.
I'd be happy to contribute to the changes, but to conclude, I think we should infer the date part to delete and have a separate microbatch materialization that can be adjusted when e.g. ducklake support arrives instead of now using either merge or delete+insert.
Additionally I think the delete+insert strategy is now actually so close to the default dbt delete+insert strategy that we might as well fall back to that and have one less incremental strategy to maintain.
I spent a bit of time today trying to understand how the dbt microbatching process affects duckdb with different configurations, but I'm honestly struggling to get it to work in a decent way. If I have limited memory (e.g. 8GB) I constantly get OOM errors. That would make the microbatch strategy unusable for me, unless it would be on e.g. ducklake.
If anyone wants to replicate some of the tests I'm running, here's the repo I used to easily create a few tests based on clickbench data. These are the results on my end:
| model_name | target_name | model_status | model_runtime_s |
|---|---|---|---|
| table | incremental_multi_thread | success | 25.070116 |
| incremental__del_ins_date_partition | incremental_multi_thread | success | 81.30077 |
| incremental__del_ins_ukey_date | incremental_multi_thread | success | 87.08315 |
| incremental__del_ins_ukey | incremental_multi_thread | success | 93.9573 |
| microbatch_default | incremental_single_thread | success | 204.85132 |
| microbatch_ukey | incremental_multi_thread | partial success | 87.09769 |
| microbatch_default_concurrent | incremental_multi_thread | partial success | 93.76382 |
| microbatch_default | incremental_multi_thread | partial success | 95.302986 |
| microbatch_ukey_date_partition | incremental_multi_thread | partial success | 101.5172 |
| microbatch_ukey | incremental_single_thread | partial success | 220.9444 |
| microbatch_ukey_date_partition | incremental_single_thread | error | 1.297792 |
| microbatch_date_partition | incremental_multi_thread | error | 9.356611 |
| microbatch_date_partition | incremental_single_thread | error | 13.908075 |
| incremental__merge | incremental_multi_thread | error | 103.943726 |
| incremental__merge_update_columns | incremental_multi_thread | error | 109.29556 |
| incremental__merge_ukey_date | incremental_multi_thread | error | 114.55945 |
Hi @gfrmin, @jwills, sorry for the duplication, but I unfortunately cannot push to either this PR as a maintainer or @gfrmin's branch, so I've created a separate PR to show my approach for microbatching with DuckDB in #681. The changes are all in the microbatch.sql materialization. This prevents OOM errors, and I think it'll be more robust when adding support for ducklake in the future. Let me know what you think! 🙏