dbt-duckdb
dbt-duckdb copied to clipboard
Incremental external models
Hi
Do you think it's feasible to add support for incremental materialisation to external location? Ie. extending the external materialisation to support incremental semantics.
Namely I was thinking of the simplest case of an "append" strategy implemented as just writing a new parquet file. And the model overall would be defined with a glob location.
Do you see any drawbacks of such an approach? I'm happy to submit a PR if you point me in the right direction in terms of what API you'd expect this to have? Make this part of the external materialisation, or of the incremental one, or a completely new one?
Thanks Nuno
Hey Nuno! I don't know if that is feasible; as I understand it, dbt only supports incremental materialization against the table materialization type. The only way to do it, I think, would be to unify the external materialization into the table materialization (so that an external materialization would become just a special case of a table materialization), but I don't have a great sense for how complex that would be to do.
Hey @nfcampos, parquet doesn't (AFAIK) support any updates in place that aren't overwrite, which means that we can't directly append into parquets. I'm unsure about CSVs, at a guess inserts might be possible but I doubt updates would be.
Generally, this smells a lot like Iceberg / something similar. I remember when myself and @tomsej were initially discussing external materializations (over in https://github.com/jwills/dbt-duckdb/issues/19), we made a decision to not make this adapter too file management-y which I personally think this pattern could quickly introduce.
I think what I meant with parquet files was to add a new parquet file to the same directory, with the model defined using a glob, see here on duckdb support for this https://duckdb.org/docs/data/parquet#multi-file-reads-and-globs
But I agree that this walks in the direction of things like Iceberg etc. If only Iceberg could be used in a more serverless way, ie. without a catalog server/db running somewhere... Actually maybe it can, with a JDBC catalog ran from a sqlite db, but haven't tried that 🤔 (only mention of something like this I can find https://github.com/apache/iceberg/issues/3245#issuecomment-1031730083)
I think what I meant with parquet files was to add a new parquet file to the same directory, with the model defined using a glob, see here on duckdb support for this https://duckdb.org/docs/data/parquet#multi-file-reads-and-globs
But I agree that this walks in the direction of things like Iceberg etc. If only Iceberg could be used in a more serverless way, ie. without a catalog server/db running somewhere... Actually maybe it can, with a JDBC catalog ran from a sqlite db, but haven't tried that 🤔 (only mention of something like this I can find apache/iceberg#3245 (comment))
Keep an eye on https://github.com/sutoiku/puffin re serverless iceberg 😉 I'm very excited about this.
On the globs, I've used it and it's fine. It's something that would be cool to extend the materializations with. I will probably spike this out over the weekend and follow up here 👍
@jpmmcneill Ah did not know about Puffin, very interesting