dbt-duckdb icon indicating copy to clipboard operation
dbt-duckdb copied to clipboard

feat(duckdb): add microbatch support for incremental strategies

Open gfrmin opened this issue 2 months ago • 5 comments

Added version-aware support for MERGE statements in DuckDB adapter for versions 1.4.0 and above. Introduced _supports_merge property to check and cache version compatibility, enabling "merge" and "microbatch" incremental strategies alongside existing "append" and "delete+insert". Updated SQL macros to implement microbatch and merge logic for efficient batch processing and upserts.

gfrmin avatar Oct 08 '25 13:10 gfrmin

Ah @gfrmin thanks for this-- we also have #605 that adds support for the merge strategy and has a richer set of configs for it afaict.

I don't normally need to rectify two different impls of the same feature, happy to discuss this in the dbt Slack or DuckDB discord

jwills avatar Oct 08 '25 16:10 jwills

Oh wow, thanks for the tip! Will join the Slack to deduplicate efforts :)

gfrmin avatar Oct 08 '25 16:10 gfrmin

@gfrmin something unhappy here in this latest run: https://github.com/duckdb/dbt-duckdb/actions/runs/18803285639/job/53661107934?pr=644

jwills avatar Oct 25 '25 18:10 jwills

I updated how timestamps in the fixtures are defined to overcome 'timestamp is not UTC' errors

gfrmin avatar Oct 29 '25 15:10 gfrmin

@gfrmin okay, 605 is in-- you should be good to rebase on main and then we can just focus on the microbatch stuff!

jwills avatar Oct 30 '25 18:10 jwills

Hey! Great effort already. I have a couple of thoughts on this after trying it out on the clickbench dataset and seeing some big differences in performance and out-of-memory issues. The current microbatch implementation uses the new merge strategy, but falls back to delete+insert when <=1.4.0. However, those strategies are pretty different. The delete+insert does where ({{ unique_key }}) in (select ({{ unique_key }}) from {{ source }}) on a non-composite key for deletion and then inserts. Whereas the merge materialization matches the predicates (similar to composite keys on delete+insert).

From a performance perspective I think we should think about microbatching like this:

  • batches get inserted by date part (hour/day/month/etc)
  • row groups are optimised by this date part without explicit ordering because of the insertion time
  • deleting existing row groups and replacing them should be fast because the query planner can prune on something like date_trunc(event_time, 'day') based on the microbatching configuration

This has the added benefit that when ducklake is fully supported, you can do actual partition overwrites on the underlying folders.

I'd be happy to contribute to the changes, but to conclude, I think we should infer the date part to delete and have a separate microbatch materialization that can be adjusted when e.g. ducklake support arrives instead of now using either merge or delete+insert.

Additionally I think the delete+insert strategy is now actually so close to the default dbt delete+insert strategy that we might as well fall back to that and have one less incremental strategy to maintain.

dumkydewilde avatar Dec 12 '25 11:12 dumkydewilde

I spent a bit of time today trying to understand how the dbt microbatching process affects duckdb with different configurations, but I'm honestly struggling to get it to work in a decent way. If I have limited memory (e.g. 8GB) I constantly get OOM errors. That would make the microbatch strategy unusable for me, unless it would be on e.g. ducklake.

If anyone wants to replicate some of the tests I'm running, here's the repo I used to easily create a few tests based on clickbench data. These are the results on my end:

model_name target_name model_status model_runtime_s
table incremental_multi_thread success 25.070116
incremental__del_ins_date_partition incremental_multi_thread success 81.30077
incremental__del_ins_ukey_date incremental_multi_thread success 87.08315
incremental__del_ins_ukey incremental_multi_thread success 93.9573
microbatch_default incremental_single_thread success 204.85132
microbatch_ukey incremental_multi_thread partial success 87.09769
microbatch_default_concurrent incremental_multi_thread partial success 93.76382
microbatch_default incremental_multi_thread partial success 95.302986
microbatch_ukey_date_partition incremental_multi_thread partial success 101.5172
microbatch_ukey incremental_single_thread partial success 220.9444
microbatch_ukey_date_partition incremental_single_thread error 1.297792
microbatch_date_partition incremental_multi_thread error 9.356611
microbatch_date_partition incremental_single_thread error 13.908075
incremental__merge incremental_multi_thread error 103.943726
incremental__merge_update_columns incremental_multi_thread error 109.29556
incremental__merge_ukey_date incremental_multi_thread error 114.55945

dumkydewilde avatar Dec 19 '25 15:12 dumkydewilde

Hi @gfrmin, @jwills, sorry for the duplication, but I unfortunately cannot push to either this PR as a maintainer or @gfrmin's branch, so I've created a separate PR to show my approach for microbatching with DuckDB in #681. The changes are all in the microbatch.sql materialization. This prevents OOM errors, and I think it'll be more robust when adding support for ducklake in the future. Let me know what you think! 🙏

dumkydewilde avatar Dec 23 '25 14:12 dumkydewilde